Targeted and Whole-Genome Technologies to Profile DNA Cytosine Methylation

ABSTRACT

Methods and compositions for determining a methylated cytosine profile of a target nucleic acid sequence are provided.

RELATED APPLICATIONS

This application claims priority to U.S. Provisional Patent ApplicationNo. 61/162,913, filed on May 24, 2009 and is hereby incorporated hereinby reference in its entirety for all purposes.

STATEMENT OF GOVERNMENT INTERESTS

This invention was made with government support under HG003170 awardedby the National Institutes of Health. The government has certain rightsin the invention.

BACKGROUND

1. Field of the Invention

Embodiments of the present invention relate in general to methods andcompositions for profiling the methylation state of cytosine residues ina nucleic acid sample.

2. Description of Related Art

Cytosine methylation, an epigenetic modification of DNA, plays animportant role in embryogenesis, cancer, and other human diseases (Goll,M. G. & Bestor, T. H., Eukaryotic cytosine methyltransferases. Annu RevBiochem 74, 481-514 (2005); Suzuki, M. M. & Bird, A., DNA methylationlandscapes: provocative insights from epigenomics. Nat Rev Genet 9 (6),465-476 (2008); Feinberg, A. P. & Tycko, B., The history of cancerepigenetics. Nat Rev Cancer 4 (2), 143-153 (2004); Jiang, Y. H.,Bressler, J., & Beaudet, A. L., Epigenetics and human disease. Annu RevGenomics Hum Genet 5, 479-510 (2004)). Although a variety of methods areavailable to study cytosine methylation, many are limited byinsufficient throughput, low accuracy, or inherent biases (Suzuki, M. M.& Bird, A., DNA methylation landscapes: provocative insights fromepigenomics. Nat Rev Genet 9 (6), 465-476 (2008); Beck, S. & Rakyan, V.K., The methylome: approaches for global DNA methylation profiling.Trends Genet 24 (5), 231-237 (2008); Zilberman, D. & Henikoff, S.,Genome-wide analysis of DNA methylation patterns. Development 134 (22),3959-3965 (2007)).

SUMMARY

Accordingly, the present invention is directed in part on the discoveryof two new, complementary techniques for cytosine methylation profiling,bisulfite padlock probes (BSPPs), and methyl sensitive cut counting(MSCC), both of which utilize the power of next generation sequencingtechnology. In the first method, a set of ˜10,000 BSPPs complementary totarget DNA (e.g., ENCODE regions) was designed (Birney, E. et al.,Identification and analysis of functional elements in 1% of the humangenome by the ENCODE pilot project. Nature 447 (7146), 799-816 (2007)).A pattern of low promoter methylation coupled with high gene bodymethylation was observed in highly expressed genes. Using the secondmethod, MSCC, genome-wide data was gathered for 1.4 million HpaII sitesand it was determined that gene body methylation in highly expressedgenes was a consistent phenomenon over the entire genome. In addition,it was determined that expression-related differences in promotermethylation were larger outside of CpG islands than within, highlightingthe usefulness of DNA methylation profiling technologies like BSPP andMSCC that are not strongly biased in favor of CpG islands.

Accordingly, a method for determining a methylated cytosine profile of atarget nucleic acid sequence is provided. In certain exemplaryembodiments, the method includes the steps of providing a sample ofnucleic acid sequences, contacting the sample with a chemical agent(e.g., bisulfite) to convert unmethylated cytosine residues in thenucleic acid sequences to uracil residues, contacting the sample with aplurality of nucleic acid probes, wherein the probes are designed tohybridize randomly along a target nucleic acid sequence, allowinghybridization of the plurality of nucleic acid probes to the targetnucleic acid sequence, forming a plurality of circular nucleic acidsequences, each of the circular sequences comprising a nucleic acidprobe sequence and a target nucleic acid sequence, amplifying theplurality of circular nucleic acid sequences to form a plurality ofamplified target nucleic acid sequences, and sequencing the amplifiedtarget nucleic acid sequences. In certain aspects, probes (e.g., padlockprobes) are designed to hybridize to promoter regions along a targetnucleic acid sequence. In other aspects, amplification primers hybridizeto nucleic acid probe sequences during the step of amplifying. In stillother aspects, the target nucleic acid sequence is any combination ofgenomic DNA (e.g., whole genome DNA), one or more genes, and one or morepromoter regions.

In other exemplary embodiments, the method includes the steps ofproviding a sample of nucleic acid sequences, cleaving the nucleic acidsequences in a methylation-dependent manner to generate a plurality ofcleaved target nucleic acid sequences, ligating first adapter sequencetags to the 5′ ends of cleaved target nucleic acid sequences and secondadapter sequence tags to the 3′ ends of the cleaved target nucleic acidsequences, amplifying the cleaved target nucleic acid sequences havingfirst and second adapter sequence tags ligated thereto, and sequencingthe amplified, cleaved target nucleic acid sequences. In certainaspects, the step of cleaving the nucleic acid sequences in amethylation-dependent manner comprises contacting the nucleic acidsequences with a methyl sensitive restriction enzyme to cleaveunmethylated CpG dinucleotide sequences. In other aspects, amplificationprimers hybridize to the first or the second adapter sequence tagsduring the step of amplifying. In other aspects, the target nucleic acidsequence is any combination of genomic DNA (e.g., whole genome DNA), oneor more genes, and one or more promoter regions. In certain aspects, themethod further includes the step of comparing the methylated cytosineprofile of the target nucleic acid sequence to a methylated cytosineprofile of a control library, such as, e.g., a control library isgenerated by contacting a target nucleic acid sequence with amethylation-insensitive enzyme (e.g., MspI).

In other exemplary embodiments, a method for determining a complementarymethylated cytosine library of a target nucleic acid sequence isprovided. The method includes the steps of providing a sample of nucleicacid sequences, cleaving the nucleic acid sequences in amethylation-dependent manner to generate a plurality of cleaved targetnucleic acid sequences, blocking the ends of the cleaved target nucleicacid sequences to prevent the cleaved target nucleic acid sequences fromcontributing to library construction, and contacting the blocked,cleaved target nucleic acid sequences with a methylation-insensitiveenzyme to create a complementary methylated cytosine library thatcomprises a plurality of nucleic acid sequences that were not cleaved ina methylation-dependent manner. In certain aspects, the blocking stepincludes dephosphorylating the 5′ ends of the cleaved target nucleicacid sequences.

Further features and advantages of certain embodiments of the presentinvention will become more fully apparent in the following descriptionof the embodiments and drawings thereof, and from the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

The patent or application file contains at least one drawing executed incolor. Copies of this patent or patent application publication withcolor drawing(s) will be provided by the Office upon request and paymentof the necessary fee. The foregoing and other features and advantages ofthe present invention will be more fully understood from the followingdetailed description of illustrative embodiments taken in conjunctionwith the accompanying drawings in which:

FIGS. 1A-1C depict bisulfite padlock probe (BSPP) technology enablingaccurate measurement of methylation levels. (A) A BSPP experimentalscheme. Two hybridizing locus-specific arms (blue) were connected by a50 base pair common “backbone” sequence (green). Approximately 10,000BSPPs were designed to target CpG sites in bisulfite-treated DNA with aCpG located at the 3′ end of the polymerized span (red). Circles wereformed by addition of polymerase, dNTP, and ligase, and weresubsequently amplified using the backbone sequence as primers.Sequencing was then performed using an Illumina Genome Analyzer with aprimer matching the backbone sequence. 28 bases of arm sequence wereread through before sequencing informative positions within the span(read lengths were 36 bases in total). (B) Correlation of methylationlevel in the technical replicates (Pearson coefficient r=0.965). (C)Correlation of BSPP methylation with the methylation levels determinedby bisulfite PCR followed by Sanger sequencing at 33 locations(r=0.966).

FIGS. 2A-2F graphically depict methylation versus gene positions, splitby gene expression level. (A) Running median methylation vs. geneposition for high and low expression genes in ENCODE regions of theGM06990 cell line (based on BSPP data). (B)-(F) were based on methylsensitive cut counting (MSCC) data and share the same key. (B) Runningaverage methylation vs. gene position for all genes in the PGP1EBV-transformed lymphoblastoid cell line, and split into five groupsbased on expression level. Contribution of each MSCC data point wasnormalized for local CpG density MspI control counts and, for siteswithin the gene, for gene length. (C) Running average methylation vs.position relative to transcription start site (TSS) (MSCC). Themethylation pattern appeared to have two valleys on either side of TSS.This was very similar to the pattern of H3K4 methylation in promoterregions (Barski, A. et al., High-resolution profiling of histonemethylations in the human genome. Cell 129 (4), 823-837 (2007)). (D)Running average methylation vs. position relative to transcriptional endof genes (for genes at least 15 kb in length). (E) Running averagemethylation vs. position relative to transcription start site forlocations within CpG islands. (F) Running average methylation vs.position relative to transcription start site for locations outside CpGislands.

FIGS. 3A-3B depict MSCC technology allowing accurate estimate ofmethylation levels. (A) Scheme of generation of a methyl sensitive cutsite library. (1) HpaII digestion cuts genomic DNA at all unmethylatedCCGG sites only. (2) The first adapter containing an MmeI recognitionsite was ligated. (3) MmeI digestion cut into the unknown genomicsequence to produce an 18-19 base pair tag. (4) A second adapter wasadded by ligation. (5) The library was amplified and sequenced. Thenumber of reads for a given site was correlated with the amount ofdigestion that occurred there and was thus an indication of methylationlevel. (B) BSPP methylation vs. MSCC counts data was grouped into 22equal bins according to the BSPP-determined methylation levels. Theaverage number of counts (black points) were linearly related to theaverage methylation of a bin (blue best fit line is shown). The standarddeviation for each data point was also shown (horizontal green), alongwith the standard deviation for a Poisson distribution with the averagevalue (horizontal blue).

FIGS. 4A-4B depict methylation profiles of individual genes. (A)Individual genes were plotted according to the average MSCC counts foundin the promoters (horizontal axis, −400 to +1000 relative to start) andgene bodies (vertical axis, between the gene end and +2000 relative tostart). The color of each point reflects the expression level of thatgene. Only data points from outside CpG islands and only genes with atleast 5 data points in each region were used. (B) A histogram of averagegene body methylation appeared to be bimodally distributed (gray). Whensplit into two equally sized groups based on gene expression, highexpression genes (red) match the right peak and low expression genes(blue) matched the left peak. Only genes with at least 50 data points inthe gene body region were used.

FIGS. 5A-5D depict BSPP capturing and correlation of probe observationsfor GM06990. (A) Padlock probes were expected to give rise to librarymolecules of 155 base pairs (arrow) after amplification. This band waspurified and used for Sanger and Illumina Genome Analyzer sequencing.The high band (271 base pairs) was the result of amplification productsproduced by polymerization making an extra trip around the circularizedmolecule. The low band (approximately 45 base pairs) was derived fromprimers. M: 25 base pair DNA ladder (Invitrogen). 1 and 2: two technicalreplicates. The same patterns of DNA bands were observed for othersamples (PGP1L, PGP9L, PGP9F, and PGP9 iPS). (B) A histogram of thenumber of reads each probe observed, probes were sorted according to thenumber of reads observed. (C) A comparison of the number of reads forindividual probes between technical replicates was highly correlated(Pearson correlation r=0.956, Spearman ranked correlation ρ=0.968). (D)A histogram of the number of reads for each probe was observed for eachrun. Probes were sorted according to the number of reads observed.

FIGS. 6A-6H depict histograms of CpG methylation for BSPP data inGM06990 and PGP cell lines. (A) GM06990 EBV-transformed B-lymphocytes.(B) PGP1 EBV-transformed B-lymphocytes. (C) PGP9 EBV-transformedB-lymphocytes. (D) PGP9 fibroblasts. (E) PGP9 fibroblasts. (F) PGP1induced pluripotent cells. (G) PGP9 induced pluripotent cells, clone 1.(H) PGP9 induced pluripotent cells, clone 2.

FIG. 7 depicts a histogram of correlations for methylation state ofsame-strand CpG pairs. For probes capturing more than a single CpG, thesubset of sites for which both CpG's had intermediate methylation levelswas taken (between 20% and 80% and at least 100 total reads), and thecorrelation of methylation state on individual strands for each pair ofsites was determined. Sites were generally positively correlated with acoefficient of around 0.5. A mixture entirely of “C & C” and “T & T”haplotypes would be perfectly correlated, with a coefficient of 1; amixture of “C & T” and “T & C” would be perfectly anti-correlated with acoefficient of −1; a random mixture would give rise to a coefficient of0.

FIGS. 8A-8D depict correlations of BSPP methylation data with chromatinimmunoprecipitation data. Using chromatin immunoprecipitation (ChIP)data from the ENCODE project produced for histone modifications in theGM06990 cell line, methylation measurements for individual CpGs werecompared with the ChIP scores at those locations. It was determined that(A) methylation was positively correlated with H3K36me3, and that (B)methylation was negatively correlated with H3K27me3. These data areconsistent with how these histone modifications are distributed inexpressed versus inactive genes. (C) H3K36me3 was high in the gene bodyof highly expressed genes, and so it is positively correlated with theobservation of high methylation in highly expressed genes. (D) H3K27me3was high in the gene body of inactive genes, and so it was negativelycorrelated.

FIGS. 9A-9D depict a comparison of methylation levels at individualsites between PGP cell lines. Using the methylation levels gathered withbisulfite padlock probes, methylation at individual sites between celllines was compared. (A) PGP1 EBV-transformed B-lymphocyte vs. PGP9EBV-transformed B-lymphocyte (Pearson correlation r=0.85, Spearmanranked correlation ρ=0.87). (B) PGP9 EBV-transformed B-lymphocyte vsPGP9 fibroblast (Pearson correlation r=0.63, Spearman ranked correlationρ=0.63). (C) PGP9 fibroblast vs. PGP9 induced pluripotent clone 1(Pearson correlation r=0.46, Spearman ranked correlation ρ=0.45). (D)PGP9 fibroblast vs. PGP9 induced pluripotent clone 2 (Pearsoncorrelation r=0.46, Spearman ranked correlation ρ=0.45).

FIGS. 10A-10G depict methylation versus position for PGP cell line usingBSPP data. Running median methylation of high expression and lowexpression genes within the ENCODE regions of PGP cell lines, based onBSPP data. Although these cell lines had different amounts of genomicmethylation (see FIG. 5 for histograms), there was a consistent pattern:high expression genes had a consistent pattern of low promotermethylation coupled with high gene body methylation; low expressiongenes had a constant methylation throughout that varied depending on theoverall levels of methylation in the sample. All panels share the samekey. (A) PGP1 lymphocyte. (B) PGP9 lymphocyte. (C) PGP1 fibroblast. (D)PGP9 fibroblast. (E) PGP1 induced pluripotent cells. (F) PGP9 inducedpluripotent cells (clone 1). (G) PGP9 induced pluripotent cells (clone2).

FIGS. 11A-11C depict histograms of the number of counts for MSCC data.(A) Histogram of number of sites for each MSCC HpaII counts value ineach replicate. (B) Histogram of number of sites for each MSCC MspIcontrol counts value. (C) Two dimensional histogram showing thecorrelation between counts from MSCC HpaII replicate 1 and replicate 2(r=0.818).

FIGS. 12A-12B depict a prediction of methylation for individual MSCClocations for a given number of counts. Horizontal bars denote themedian methylation for a given range of counts, boxes mark the 25th and75th percentiles, whiskers mark the 5th and 95th percentiles. (A)Methylation levels for individual tags. (B) Methylation levels forcombined tag counts. Some sites have data for both possible tags, thesecan be added together to create a more accurate methylation prediction.

FIG. 13 depicts promoter versus gene body for individual genes, usingall data points (based on MSCC data). This figure is plotted the sameway as FIG. 4, except that all data points (both inside and outside CpGislands) were used. Gene promoter methylation is the horizontal axis,gene body methylation is the vertical axis, and color reflects geneexpression rank. Only genes with at least ten data points in each regionwere used.

FIGS. 14A-14C graphically depict the effects of CpG density andmethylation in genes with different levels of expression. (A) High CpGpromoters (65% of all promoters) tend to have little methylationregardless of expression. (B) Intermediate CpG promoters (16% ofpromoters) tend to have low levels of methylation in highly expressedgenes and high levels of methylation in weakly expressed genes. (C) LowCpG promoters (28% of promoters) tend to be highly methylated regardlessof gene expression.

FIGS. 15A-15B depict individual sites comparison of BSPP methylation vs.MSCC HpaII counts for single and paired tags. There were a total of 381sites and, of those, 345 had MSCC data for both tags (“paired”) for atotal of 726 tags. (A) The plot of individual tag counts vs. BSPPmethylation for the 726 individual tags. (B) A plot of combined MSCC tagcounts for the 345 sites with paired tags showed that the data becamemore accurate for these sites. The sum of paired tag counts (B) had astronger correlation to methylation (B: r=−0.73, ρ=−0.79) thanindividual tags (A: r=−0.63, ρ=−0.70). Of the 1.4 million MSCC sites,most (888k, 63%) had paired tags.

FIGS. 16A-16D depict “inverse library” results. Preliminary results withan “inverse library” of tags derived from methylated CCGG sites. Thelibrary was constructed by dephosphorylating a HpaII digest, blockingthem from ligation. The DNA was then cut at remaining CCGG sites withthe methylation-insensitive isoschizomer MspI and a library wasconstructed from these ends as before. With an inverse library, absolutemethylation estimates could be made in the following manner: Based onthe estimated average of 1.7 inverse library counts per 100% methylatedsite (A) and the estimated average of 8.9 MSCC HpaII library counts per0% methylated site (FIG. 3B), inverse library counts are normalized toHpaII counts by multiplying by 5.2. Then, for each site: Normalized sumcounts=normalized inverse library counts+HpaII library counts. Then,using only sites with a normalized sum of at least 7, Estimatedmethylation=(normalized inverse library counts)/(normalized sum counts).(A) “Inverse library” single tag counts vs. methylation as determined byBSPP. These were positively correlated with methylation (r=0.30,ρ=0.31). When data was averaged in 20 bins as per FIG. 3B, a linear fitof f(x)=a*x to the average values found a value of a=0.58, indicatingthat tags from fully methylated sites produce an average ofapproximately 1.7 counts. (B) “Inverse library” combined tag counts forpaired sites vs. BSPP methylation. As with the original MSCC library,these were more strongly correlated with methylation (r=0.36, ρ=0.38).(C) Estimated methylation based on combined HpaII and inverse librarycounts for single tags with a normalized counts sum of at least 7(r=0.77, ρ=0.78). (D) Averaged estimated methylation for paired taglocations where both tags had a normalized counts sum of at least 7(r=0.85, ρ=0.87).

FIG. 17 depicts estimates of increased MSCC accuracy with moresequencing reads. The probabilities over 70, 80 and 90% are highlightedin light green, yellow and red, respectively.

DETAILED DESCRIPTION

The principles of the present invention may be applied with particularadvantage in methods of detecting nucleic acid (e.g., DNA) methylationpatterns and changes in methylation patterns in nucleic acid sequencessuch as e.g., one or more genes or a whole genome. In certain exemplaryembodiments, methods and compositions to detect nucleic acid methylationrelating to genome instability that leads to a disease state(s) or achange in general health are provided. As used herein, the terms“methylation biomarker,” “disease-specific methylated restriction sitepattern” and “methylation fingerprint,” refer to any sequence ofnucleotides, such as CpG rich regions, where the 5′ position of anycytosine base becomes methylated. These regions may be found in anynucleotide sequence including, but not limited to, promoters, regulatoryelements, enhancers, and gene coding sequences. Changes in anymethylation fingerprint may be an indicator of genome instability andmay be useful in the diagnosis of disease. For example, changes in amethylation fingerprint may alter the accessibility of the DNA bindingproteins to bind to the DNA.

As used herein, a “nucleic acid target region” refers to a nucleic acidsequence that is examined using the methods disclosed herein. A nucleicacid target region includes whole-genome DNA, a segment of genomic DNA(e.g., a gene, a promoter region and the like), whole mitochondrial DNA,a segment of mitochondrial DNA and the like. In the context of methodsfor phenotype identification, the invention provides methods foridentifying the methylation state of a nucleic acid target gene regionand/or the methylation state of a nucleotide locus, a nucleic acidtarget gene region can also refer to an amplified product of a nucleicacid target gene region, including an amplified product of a treatednucleic acid target gene region, where the nucleotide sequence of suchan amplified product reflects the methylation state of the nucleic acidtarget gene region. One skilled in the art would recognize that the sizeor length of the nucleic acid target gene region may vary depending onthe limitation, or limitations, of the equipment used to perform theanalysis. The nucleic acid target gene region may comprise more than onegene of interest, at least one gene of interest, a portion of a gene ofinterest, a promoter of a gene of interest or any combination of these.Correspondingly, a sequential or non-sequential series of nucleic acidtarget gene regions may be analyzed and exploited to map an entire geneor genome. The intended target will be clear from the context or will bespecified.

As used herein, the “methylation state” of a nucleic acid target nucleicacid sequence refers to the presence or absence of one or moremethylated nucleotide bases or the ratio of methylated cytosine tounmethylated cytosine for a methylation site in a nucleic acid targetregion. For example, a nucleic acid target region containing at leastone methylated cytosine is considered methylated (i.e., the methylationstate of the nucleic acid target gene region is methylated). A nucleicacid target gene region that does not contain any methylated nucleotidesis considered unmethylated. Similarly, the methylation state of anucleotide locus in a nucleic acid target gene region refers to thepresence or absence of a methylated nucleotide at a particular locus inthe nucleic acid target gene region. For example, the methylation stateof a cytosine at the 7th nucleotide in a nucleic acid target gene regionis methylated when the nucleotide present at the 7th nucleotide in thenucleic acid target gene region is 5-methylcytosine. Similarly, themethylation state of a cytosine at the 7th nucleotide in a nucleic acidtarget gene region is unmethylated when the nucleotide present at the7th nucleotide in the nucleic acid target gene region is cytosine (andnot 5-methylcytosine). Correspondingly, the ratio of methylated cytosineto unmethylated cytosine for a methylation site or sites can provide amethylation state of a nucleic acid target region.

As used herein, a “characteristic methylation state” refers to a uniqueor specific data set comprising the location of at least one, a portionof the total or all of the methylation sites of a nucleic acid, anucleic acid target gene region or a gene of a sample obtained from anorganism, a tissue or a cell.

As used herein, “methylation ratio” refers to the number of instances inwhich a molecule or locus is methylated relative to the sum ofmethylated and unmethylated sites (i.e., in the entire sample).Methylation ratio can be used to describe a population of individuals ora sample from a single individual. For example, a methylation ratio at asingle locus can be used to compare different nucleic acid moleculesderived from one or more samples (e.g., cells, tissues and the like) ofa single person, a specific person and at least one other person, oramong a group of people. Methylation ratios can be used, for example, todescribe the degree to which a nucleotide locus or nucleic acid regionis methylated in a population of individuals. Thus, when methylation ina first population or pool of nucleic acid molecules is different frommethylation in a second population or pool of nucleic acid molecules,the methylation ratio of the first population or pool will be differentfrom the methylation ratio of the second population or pool. Such aratio also can be used, for example, to describe the degree to which anucleotide locus or nucleic acid region is methylated in a singleindividual. For example, such a ratio can be used to describe the degreeto which a nucleic acid target gene region of a group of cells from atissue sample are methylated or unmethylated at a nucleotide locus ormethylation site.

As used herein, a “methylated nucleotide” or a “methylated nucleotidebase” refers to the presence of a methyl moiety on a nucleotide base,where the methyl moiety is not present in a recognized typicalnucleotide base. For example, cytosine does not contain a methyl moietyon its pyrimidine ring, but 5-methylcytosine contains a methyl moiety atposition 5 of its pyrimidine ring. Therefore, cytosine is not amethylated nucleotide and 5-methylcytosine is a methylated nucleotide.In another example, thymine contains a methyl moiety at position 5 ofits pyrimidine ring, however, for purposes herein, thymine is notconsidered a methylated nucleotide when present in DNA since thymine isa typical nucleotide base of DNA. Typical nucleoside bases for DNA arethymine, adenine, cytosine and guanine Typical bases for RNA are uracil,adenine, cytosine and guanine Correspondingly a “methylation site” isthe location in the target gene nucleic acid region where methylationhas occurred, or has the possibility of occurring. For example alocation containing CpG is a methylation site wherein the cytosine mayor may not be methylated.

As used herein, a “methylation site” refers to a nucleotide within anucleic acid, nucleic acid target gene region or gene that issusceptible to methylation either by natural occurring events in vivo orby an event instituted to chemically methylate the nucleotide in vitro.

As used herein, the term “methylation sensitive enzyme” refers to anenzyme that cleaves in a methylation-dependent manner, i.e., the enzymeeither preferentially cleaves methylated recognition sites orpreferentially cleaves unmethylated recognition sites. An example of anenzyme that cleaves a methylated recognition site is BisI. Examples ofenzymes that preferentially cleave unmethylated recognition sitesinclude, but are not limited to, AatII, AciI, AclI, AgeI, AscI, AsiSI,AvaI, BceAI, BmgBI, BsaAI, BsaHI, BsiEI, BsiWI, BsmBI, BspDI, BsrFI,BssHII, BstBI, BstUI, BtgZI, EagI, FauI, FseI, FspI, HaeII, HgaI, HhaI,HinPII, HpaII, Hpy99I, HpyCH4IV, MluI, Nael, NarI, NgoMIV, NotI, NruI,PaeR7I, PmlI, PvuI, RsrII, SacIl, SalI, SfoI, SgrAI, SmaI, ZraI and thelike.

The term “methylation-insensitive enzyme,” as used herein, refers to anyenzyme that will cut a nucleic acid sequence at a CpG site with orwithout a 5′-methyl cytosine. In other words, a methylation insensitiveenzyme will cleave a methylation restriction site independent of itsmethylation status. For example, one methylation insensitive enzyme isMspI.

In certain exemplary embodiments, the presence or absence of one or moremethylated or unmethylated nucleotides may be identified as indicativeof a disease state associated with methylated or unmethylated DNA, suchas a neoplastic disease. In other embodiments, the presence or absenceof one or more methylated or unmethylated nucleotides may be identifiedas indicative of a normal, healthy or disease free state. In still otherembodiments, an abnormal ratio of methylated nucleic acid target genemolecules relative to unmethylated nucleic acid target gene molecules ina sample may be indicative of a disease state associated with methylatedor unmethylated DNA, such as a neoplastic disease. For example, arelatively high number or a relatively low number of methylated nucleicacid target gene molecules compared to the relative amount in a normalindividual may be indicative of a disease state associated withmethylated or unmethylated DNA, such as a neoplastic disease. In otherembodiments, an abnormal ratio of methylated nucleotide at a nucleotidelocus relative to unmethylated nucleotide at a nucleotide locus in anucleic acid target gene molecule can be indicative of a disease stateassociated with methylated or unmethylated DNA, such as a neoplasticdisease. For example, a relatively high number or a relatively lownumber of methylated nucleotide loci compared to the relative amount ina normal individual can be indicative of a disease state associated withmethylated or unmethylated DNA, such as a neoplastic disease.

Diseases associated with a modification of the methylation of one ormore nucleotides include, for example: leukemia (Aoki E. et al.,“Methylation status of the p151NK4B gene in hematopoietic progenitorsand peripheral blood cells in myelodysplastic syndromes,” Leukemia14(4):586-593 (2000); Nosaka, K. et al., “Increasing methylation of theCDKN2A gene is associated with the progression of adult T-cell leukemia”Cancer Res. 60(4):1043-1048 (2000); Asimakopoulos F A et al., “ABL 1methylation is a distinct molecular event associated with clonalevolution of chronic myeloid leukemia” Blood 94(7):2452-2460 (1999);Fajkusova L. et al., “Detailed Mapping of Methylcytosine Positions atthe CpG Island Surrounding the Pa Promoter at the bcr-abl Locus in CMLPatients and in Two Cell Lines, K562 and BV173” Blood Cells Mol. Dis.26(3):193-204 (2000); Litz C. E. et al., “Methylation status of themajor breakpoint cluster region in Philadelphia chromosome negativeleukemias” Leukemia 6(1):35-41 (1992)), head and neck cancer(Sanchez-Cespedes M. et al. “Gene promoter hypermethylation in tumorsand serum of head and neck cancer patients” Cancer Res. 60(4):892-895(2000)), Hodgkin's disease (Garcia J. F. et al. “Loss of p16 proteinexpression associated with methylation of the p161NK4A gene is afrequent finding in Hodgkin's disease” Lab Invest. 79(12):1453-1459(1999)), gastric cancer (Yanagisawa Y. et al., “Methylation of the hMLH1promoter in familial gastric cancer with microsatellite instability”Int. J. Cancer 85(1):50-53 (2000)), prostate cancer (Rennie P. S. etal., “Epigenetic mechanisms for progression of prostate cancer” CancerMetastasis Rev. 17(4):401-409 (1998-99)), renal cancer (Clifford, S. C.et al., “Inactivation of the von Hippel-Lindau (VHL) tumor suppressorgene and allelic losses at chromosome arm 3p in primary renal cellcarcinoma: evidence for a VHL-independent pathway in clear cell renaltumourigenesis” Genes Chromosomes Cancer 22(3):200-209 (1998), bladdercancer (Sardi, I. et al., “Molecular genetic alterations of c-myconcogene in superficial and locally advanced bladder cancer” Eur. Urol.33(4):424-430 (1998), breast cancer (Mancini, D. N. et al., “CpGmethylation within the 5′ regulatory region of the BRCA1 gene is tumorspecific and includes a putative CREB binding site” Oncogene16(9):1161-1169 (1998); Zrihan-Licht S. et al., “DNA methylation statusof the MUC1 gene coding for a breast-cancer-associated protein” Int. J.Cancer 62(3):245-251 (1995); Kass, D. H. et al., “Examination of DNAmethylation of chromosomal hot spots associated with breast cancer,”Anticancer Res. 13(5A):1245-1251 (1993)), Burkitt's lymphoma (Tao, Q. etal., “Epstein-Barr virus (EBV) in endemic Burkitt's lymphoma: molecularanalysis of primary tumor tissue” Blood 91(4):1371-1381 (1998)), Wilmstumor (Kleymenova, E. V. et al., “Identification of a tumor-specificmethylation site in the Wilms tumor suppressor gene” Oncogene16(6):713-720 (1998)), Prader-Willi/Angelman syndrome (Zeschnigh et al.“Imprinted segments in the human genome: different DNA methylationpatterns in the Prader-Willi/Angelman syndrome region as determined bythe genomic sequencing method” Human Mol. Genetics (6)3:387-395 (1997);Fang P. et al., “The spectrum of mutations in UBE3A causing Angelmansyndrome” Human Mol. Genetics 80:129-135 (1999)), ICF syndrome(Tuck-Muller et al., “CMDNA hypomethylation and unusual chromosomeinstability in cell lines from ICF syndrome patients” Cytogenet CellGenet. 89(1-2):121-128 (2000)), dermatofibroma (Chen, T. C. et al.,“Dermatofibroma is a clonal proliferative disease” J. Cutan Pathol27(1):36-39 (2000)), hypertension (Lee, S. D. et al., “Monoclonalendothelial cell proliferation is present in primary but not secondarypulmonary hypertension” J. Clin. Invest. 101(5):927-934 (1998)),pediatric neurological disorders (Campos-Castello, J. et al., “Thephenomenon of genomic ‘imprinting’ and its implications in clinicalneuropediatrics” Rev. Neurol. 28(1):69-73 (1999)), autism (Klauck, S. M.et al., “Molecular genetic analysis of the FMR-1 gene in a largecollection of autistic patients” Hum Genet 100(2):224-229 (1997)),ulcerative colitis (Gloria, L. et al., “DNA hypomethylation andproliferative activity are increased in the rectal mucosa of patientswith long-standing ulcerative colitis” Cancer 78(11):2300-2306 (1996)),fragile X syndrome (Hornstra, I. K. et al., “High resolution methylationanalysis of the FMR1 gene trinucleotide repeat region in fragile Xsyndrome” Human Mol. Genetics 2(10):1659-1665 (1993)), and Huntington'sdisease (Ferluga, J. et al., “Possible organ and age-related epigeneticfactors in Huntington's disease and colorectal carcinoma” Med.Hypotheses 29(1):51-54 (1998)).

Additional diseases associated with the epigenetic state of DNA include,but are not limited to, low grade astrocytoma, anaplastic astrocytoma,glioblastoma, medulloblastoma, colon cancer, lung cancer, pancreaticcancer, endometrial cancer, neuroblastoma, headaches, sexualmalfunction, primary myxedema, pernicious anemia, Addison's disease,myasthenia gravis, juvenile diabetes, idiopathic thrombocytopenicpurpura, multiple sclerosis, rheumatoid arthritis, scleroderma, andother disorders such as CNS malfunctions, damage or disease, symptoms ofaggression or behavioral disturbances, clinical, psychological andsocial consequences of brain damage, psychotic disturbances andpersonality disorders, dementia and/or associated syndromes,cardiovascular disease, malfunction and damage, malfunction, damage ordisease of the gastrointestinal tract, malfunction, damage or disease ofthe respiratory system, lesion, inflammation, infection, immunity and/orconvalescence, malfunction, damage or disease of the body as anabnormality in the developmental process, malfunction, damage or diseaseof the skin, the muscles, the connective tissue or the bones, endocrineand metabolic malfunction, damage or disease, and also can be associatedwith undesired drug interactions.

Increased or decreased levels of methylation have been associated with avariety of diseases. Methylation or lack of methylation at definedpositions can be associated with a disease or a disease-free state. Themethods disclosed herein can be used with methods of determining thepropensity of a subject to disease, diagnosing a disease, anddetermining a treatment regimen for a subject having a disease.

The methylation state of a variety of nucleotide loci and/or nucleicacid regions are known to be correlated with a disease, disease outcome,and success of treatment of a disease, and also may be used todistinguish disease types that are difficult to distinguish according tothe symptoms, histologic samples or blood or serum samples. For example,CpG island methylator indicator phenotype (CIMP) is present in sometypes of ovarian carcinomas, but not in other types (Strathdee, et al.,Am. J. Pathol. 158:1121-1127 (2001)). In another example, methylationmay be used to distinguish between a carcinoid tumor and a pancreaticendocrine tumor, which may have different expected outcomes and diseasetreatment regimens (Chan et al., Oncogene 22:924-934 (2003)). In anotherexample, H. pylori-dependent gastric mucosa associated lymphoid tissue(MALT) lymphomas are characterized as having several methylated nucleicacid regions, while those nucleic acid regions in H. pylori-independentMALT lymphomas are not methylated (Kaneko et al., Gut 52:641-646(2003)). Similar relationships with disease, disease outcome and diseasetreatment have been correlated with hypomethylation or unmethylatednucleic acid regions or unmethylated nucleotide loci.

Methods related to the disease state of a subject may be performed bycollecting a sample from a subject, treating the sample with a reagentthat modifies a nucleic acid target sequence as a function of themethylation state of the nucleic acid target sequence, subjecting thesample to methylation specific amplification, then detecting one or morefragments that are associated with a disease or that are associated witha disease-free state. In certain embodiments, the fragments are detectedby measuring the mass of the nucleic acid target gene molecule ornucleic acid target gene molecule fragments. Detection of a nucleic acidtarget gene sequence or nucleic acid target gene sequence fragment canidentify the methylation state of a nucleic acid target gene molecule orthe methylation state of one or more nucleotide loci of a nucleic acidtarget gene molecule. Identification of the methylation state of anucleic acid target gene sequence or the methylation state of one ormore nucleotide loci of a nucleic acid target gene sequence can indicatethe propensity of the subject toward one or more diseases, the diseasestate of a subject, or an appropriate or inappropriate course of diseasetreatment or management for a subject.

There are many hybridization-based assays that comprise a hybridizationstep that forms a structure or complex with a target polynucleotide,such as a fragment of genomic DNA, and an enzymatic processing step inwhich one or more enzymes either recognize such structure or complex asa substrate or are prevented from recognizing a substrate because it isprotected by such structure or complex. In particular, such assays arewidely used in multiplexed formats to simultaneously analyze DNA samplesat multiple loci, e.g. allele-specific multiplex PCR, arrayed primerextension (APEX) technology, solution phase primer extension or ligationassays, and the like, described in the following exemplary references:Syvanen, Nature Genetics Supplement, 37: S5-S10 (2005); Shumaker et al.,Hum. Mut., 7: 346-354 (1996); Huang et al., U.S. Pat. Nos. 6,709,816 and6,287,778; Fan et al., U.S. patent publication 2003/0003490; Gundersonet al., U.S. patent publication 2005/0037393; Hardenbol et al., NatureBiotechnology, 21: 673-678 (2003); Nilsson et al., Science, 265:2085-2088 (1994); Baner et al., Nucleic Acids Research, 26: 5073-5078(1998); Lizardi et al., Nat. Genet., 19: 225-232 (1998); Gerry et al.,J. Mol. Biol., 292: 251-262 (1999); Fan et al., Genome Research, 10:853-860 (2000); International patent publications WO 2002/57491 and WO2000/58516; U.S. Pat. Nos. 6,506,594 and 4,883,750; and the like.

In one aspect, hybridization-based assays include circularizing probes,such as padlock probes, rolling circle probes, molecular inversionprobes, linear amplification molecules for multiplexed PCR, and thelike, e.g. padlock probes being disclosed in U.S. Pat. Nos. 5,871,921;6,235,472; 5,866,337; and Japanese patent JP. 4-262799; rolling circleprobes being disclosed in Aono et al., JP-4-262799; Lizardi, U.S. Pat.Nos. 5,854,033; 6,183,960; 6,344,239; molecular inversion probes beingdisclosed in Hardenbol et al. (supra) and in Willis et al., U.S. Pat.No. 6,858,412; and linear amplification molecules being disclosed inFaham et al., U.S. patent publication 2003/0104459. Such probes aredesirable because non-circularized probes can be digested with singlestranded exonucleases thereby greatly reducing background noise due tospurious amplifications, and the like. In the case of molecularinversion probes (MIPs), padlock probes, and rolling circle probes,constructs for generating labeled target sequences are formed bycircularizing a linear version of the probe in a template-drivenreaction on a target polynucleotide followed by digestion ofnon-circularized polynucleotides in the reaction mixture, such as targetpolynucleotides, unligated probe, probe concatemers, and the like, withan exonuclease, such as exonuclease I.

In certain exemplary embodiments, padlock probes are provided to profilethe methylation state of a nucleic acid sample. As used herein, the term“padlock probe” includes, but is not limited to, an oligonucleotidesequence (e.g., about 70-140 nucleotides in length) that includes tworegions of homology to a target nucleic acid sequence (e.g., genomicDNA) located at the termini or ends of the probe, two PCR primerregions, and two cleavage sites (See Hardenbol, Nature Biotech., Vol.21, No. 6., 6 Jun. 1993, Hardenbol et al., Genome Research, 2005;15(2):269-75; Fakhrai et al. (2003) Nature Biotech. 21(6):673 and Wanget al. (2005) Nucl. Acids Res. 33:e183). A padlock probe can becircularized by ligation in the presence of a correct target sequence. Auniversal detection tag sequence can be used for array detection ofamplified probe. Cleavage sites are used to release the circularizedprobe from genomic DNA and for post-amplification processing.

Methods of conducting multiplexed hybridization-based assays usingmicroarrays, and like platforms, suitable for the present invention arewell known in the art. Guidance for selecting conditions and materialsfor applying labeled sequences to solid phase supports, such asmicroarrays, may be found in the literature, e.g. Wetmur, Crit. Rev.Biochem. Mol. Biol., 26: 227-259 (1991); DeRisi et al., Science, 278:680-686 (1997); Chee et al., Science, 274: 610-614 (1996); Duggan etal., Nature Genetics, 21: 10-14 (1999); Schena, Editor, Microarrays: APractical Approach (IRL Press, Washington, 2000); Freeman et al.,Biotechniques, 29: 1042-1055 (2000); and like references. Methods andapparatus for carrying out repeated and controlled hybridizationreactions have been described in U.S. Pat. Nos. 5,871,928, 5,874,219,6,045,996 and 6,386,749, 6,391,623. Hybridization conditions typicallyinclude salt concentrations of less than about 1 M, more usually lessthan about 500 mM and less than about 200 mM. Hybridization temperaturescan be as low as 5° C., but are typically greater than 22° C., moretypically greater than about 30° C., and preferably in excess of about37° C. Hybridizations are usually performed under stringent conditions,i.e., conditions under which a probe will stably hybridize to aperfectly complementary target sequence, but will not stably hybridizeto sequences that have one or more mismatches. The stringency ofhybridization conditions depends on several factors, such as probesequence, probe length, temperature, salt concentration, concentrationof organic solvents, such as formamide, and the like. How such factorsare selected is usually a matter of design choice to one of ordinaryskill in the art for any particular embodiment. Usually, stringentconditions are selected to be about 5° C. lower than the T_(m) for thespecific sequence for particular ionic strength and pH. Exemplaryhybridization conditions include salt concentration of at least 0.01 Mto about 1 M Na ion concentration (or other salts) at a pH 7.0 to 8.3and a temperature of at least 25° C. Additional exemplary hybridizationconditions include the following: 5×SSPE (750 mM NaCl, 50 mM sodiumphosphate, 5 mM EDTA, pH 7.4).

Exemplary hybridization procedures for applying labeled target sequenceto a GENFLEX™ microarray (Affymetrix, Santa Clara, Calif.) is asfollows: denatured labeled target sequence at 95-100° C. for 10 minutesand snap cool on ice for 2-5 minutes. The microarray is pre-hybridizedwith 6×SSPE-T (0.9 M NaCl 60 mM NaH₂, PO₄, 6 mM EDTA (pH 7.4), 0.005%Triton X-100) +0.5 mg/ml of BSA for a few minutes, then hybridized with120 μL hybridization solution (as described below) at 42° C. for 2 hourson a rotisserie at 40 RPM. Hybridization Solution consists of 3M TMACL(tetramethylammonium chloride), 50 mM MES((2-[N-Morpholino]ethanesulfonic acid) Sodium Salt) (pH 6.7), 0.01% ofTriton X-100, 0.1 mg/ml of herring sperm DNA, optionally 50 pM offluorescein-labeled control oligonucleotide, 0.5 mg/ml of BSA (Sigma)and labeled target sequences in a total reaction volume of about 120 μL.The microarray is rinsed twice with 1×SSPE-T for about 10 seconds atroom temperature, then washed with 1×SSPE-T for 15-20 minutes at 40° C.on a rotisserie at 40 RPM. The microarray is then washed 10 times with6×SSPE-T at 22° C. on a fluidic station (e.g. model FS400, Affymetrix,Santa Clara, Calif.). Further processing steps may be required dependingon the nature of the label(s) employed, e.g. direct or indirect.Microarrays containing labeled target sequences may be scanned on aconfocal scanner (such as available commercially from Affymetrix) with aresolution of 60-70 pixels per feature and filters and other settings asappropriate for the labels employed. GENECHIP® (Affymetrix) or similarsoftware may be used to convert the image files into digitized files forfurther data analysis.

Embodiments of the present invention are directed to the use ofhybridization-based assays with polony sequencing technology orsynthetic genomic technology. Polony technology is described in U.S.Pat. Nos. 6,432,360, 6,485,944 and 6,511,803 and PCT/US05/06425. Ingeneral, the term “polony” refers to “polymerized colony.” Polonytechnology relates to the amplification of nucleic acids. In general, apool of nucleic acids is provided, preferably in an array where thenucleic acids are immobilized to a support. The nucleic acids arerandomly patterned on the support. The nucleic acids are then amplifiedin situ to produce colonies of polymerized nucleic acids. Polonyamplification can also take place on beads where a nucleic acid isattached to a bead and then polymerized in situ.

Samples or specimens containing target polynucleotides, such asfragments of genomic DNA, may come from a wide variety of sources foruse with the present invention, including, but not limited to, cellcultures, animal or plant tissues, patient biopsies, environmentalsamples, and the like. Samples are prepared for assays of the inventionusing conventional techniques, which typically depend on the source fromwhich a sample or specimen is taken.

Prior to carrying out reactions on a sample, it will often be desirableto perform one or more sample preparation operations upon the sample.Typically, these sample preparation operations will include suchmanipulations as extraction of intracellular material, e.g., nucleicacids from whole cell samples, viruses and the like.

For those embodiments where whole cells, viruses or other tissue samplesare being analyzed, it will typically be necessary to extract thenucleic acids from the cells or viruses, prior to continuing with thevarious sample preparation operations. Accordingly, following samplecollection, nucleic acids may be liberated from the collected cells,viral coat, etc., into a crude extract, followed by additionaltreatments to prepare the sample for subsequent operations, e.g.,denaturation of contaminating (DNA binding) proteins, purification,filtration, desalting, and the like. Liberation of nucleic acids fromthe sample cells or viruses, and denaturation of DNA binding proteinsmay generally be performed by chemical, physical, or electrolytic lysismethods. For example, chemical methods generally employ lysing agents todisrupt the cells and extract the nucleic acids from the cells, followedby treatment of the extract with chaotropic salts such as guanidiniumisothiocyanate or urea to denature any contaminating and potentiallyinterfering proteins. Generally, where chemical extraction and/ordenaturation methods are used, the appropriate reagents may beincorporated within a sample preparation chamber, a separate accessiblechamber, or may be externally introduced.

Following extraction, it will often be desirable to separate the nucleicacids from other elements of the crude extract, e.g., denaturedproteins, cell membrane particles, salts, and the like. Removal ofparticulate matter is generally accomplished by filtration, flocculationor the like. A variety of filter types may be readily incorporated intothe device. Further, where chemical denaturing methods are used, it maybe desirable to desalt the sample prior to proceeding to the next step.Desalting of the sample, and isolation of the nucleic acid may generallybe carried out in a single step, e.g., by binding the nucleic acids to asolid phase and washing away the contaminating salts or performing gelfiltration chromatography on the sample, passing salts through dialysismembranes, and the like. Suitable solid supports for nucleic acidbinding include, e.g., diatomaceous earth, silica (i.e., glass wool), orthe like. Suitable gel exclusion media, also well known in the art, mayalso be readily incorporated into the devices of the present invention,and is commercially available from, e.g., Pharmacia and Sigma Chemical.

In some applications, such as measuring target polynucleotides in rarecells from a patient's blood, an enrichment step may be carried outprior to conducting an assay, such as by immunomagnetic isolation,fluorescent cell sorting or other such technique. Such isolation orenrichment may be carried out using a variety of techniques andmaterials known in the art, as disclosed in the following representativereferences: Terstappen et al., U.S. Pat. No. 6,365,362; Terstappen etal., U.S. Pat. No. 5,646,001; Rohr et al., U.S. Pat. No. 5,998,224;Kausch et al., U.S. Pat. No. 5,665,582; Kresse et al., U.S. Pat. No.6,048,515; Kausch et al., U.S. Pat. No. 5,508,164; Miltenyi et al., U.S.Pat. No. 5,691,208; Molday, U.S. Pat. No. 4,452,773; Kronick, U.S. Pat.No. 4,375,407; Radbruch et al., Chapter 23, in Methods in Cell Biology,Vol. 42 (Academic Press, New York, 1994); Uhlen et al., Advances inBiomagnetic Separation (Eaton Publishing, Natick, 1994); Safarik et al.,J. Chromatography B, 722: 33-53 (1999); Miltenyi et al., Cytometry, 11:231-238 (1990); Nakamura et al., Biotechnol. Prog., 17: 1145-1155(2001); Moreno et al., Urology, 58: 386-392 (2001); Racila et al., Proc.Natl. Acad. Sci., 95: 4589-4594 (1998); Zigeuner et al., J. Urology,169: 701-705 (2003); Ghossein et al., Seminars in Surgical Oncology, 20:304-311 (2001).

In one aspect, genomic DNA for analysis is obtained using standardcommercially available DNA extraction kits, e.g., PUREGENE® DNAIsolation Kit (Gentra Systems, Minneapolis, Minn.). In another aspect,for assaying human genomic DNA with a multiplex hybridization-basedassay containing from about 1000 to 50,000 probes, a DNA sample may beused having an amount within the range of from about 200 ng to about 1microgram. When sample material is scarce, prior to assaying, sample DNAmay be amplified by whole genome amplification, or like technique, toincrease the total amount of DNA available for assaying. Several wholegenome, or partial genome, amplification techniques are known in theart, such as the following: Telenius et al. (1992) Genomics 13:718;Cheung et al. (1996) Proc. Natl. Acad. Sci. U.S.A. 93:14676; Dean et al.(2001) Genome Research 11:1095; U.S. Pat. Nos. 6,124,120; 6,280,949;6,617,137; and the like.

Terms and symbols of nucleic acid chemistry, biochemistry, genetics, andmolecular biology used herein follow those of standard treatises andtexts in the field, e.g., Komberg and Baker, DNA Replication, SecondEdition (W.H. Freeman, New York, 1992); Lehninger, Biochemistry, SecondEdition (Worth Publishers, New York, 1975); Strachan and Read, HumanMolecular Genetics, Second Edition (Wiley-Liss, New York, 1999);Eckstein, editor, Oligonucleotides and Analogs: A Practical Approach(Oxford University Press, New York, 1991); Gait, editor, OligonucleotideSynthesis: A Practical Approach (IRL Press, Oxford, 1984); and the like.

“Addressable” or “addressed” in reference to tag complements means thatthe nucleotide sequence, or perhaps other physical or chemicalcharacteristics, of a tag complement can be determined from its address,i.e., a one-to-one correspondence between the sequence or other propertyof the tag complement and a spatial location on, or characteristic of,the solid phase support to which it is attached. In certain aspects, anaddress of a tag complement is a spatial location, e.g., the planarcoordinates of a particular region containing copies of the tagcomplement. In other embodiments, probes may be addressed in other ways,e.g., by microparticle size, shape, color, color ratio or fluorescentratio, radio frequency of micro-transponder, or the like, e.g., Kettmanet al. (1998) Cytometry 33:234; Xu et al. (2003) Nucl. Acids Res.31:e43; Bruchez Jr. et al., U.S. Pat. No. 6,500,622; Mandecki, U.S. Pat.No. 6,376,187; Stuelpnagel et al., U.S. Pat. No. 6,396,995; Chee et al.,U.S. Pat. No. 6,544,732; Chandler et al., PCT publication WO 97/14028;and the like. According to the present invention, such terms also mayrefer to a nucleotide sequence that specifically identifies DNA or RNAsequences as having been captured from a given patient or other subject.

“Amplicon” means the product of a polynucleotide amplification reaction.That is, it is a population of polynucleotides, usually double stranded,that are replicated from one or more starting sequences. The one or morestarting sequences may be one or more copies of the same sequence, or itmay be a mixture of different sequences. Amplicons may be produced by avariety of amplification reactions whose products are multiplereplicates of one or more target nucleic acids. Generally, amplificationreactions producing amplicons are “template-driven” in that base pairingof reactants, either nucleotides or oligonucleotides, have complementsin a template polynucleotide that are required for the creation ofreaction products. In one aspect, template-driven reactions are primerextensions with a nucleic acid polymerase or oligonucleotide ligationswith a nucleic acid ligase. Such reactions include, but are not limitedto, polymerase chain reaction (PCR), linear polymerase reactions,nucleic acid sequence-based amplification (NASBA), rolling circleamplifications, and the like, disclosed in the following references:Mullis et al., U.S. Pat. Nos. 4,683,195; 4,965,188; 4,683,202; 4,800,159(PCR); Gelfand et al., U.S. Pat. No. 5,210,015 (real-time PCR with“Taqman” probes); Wittwer et al., U.S. Pat. No. 6,174,670; Kacian etal., U.S. Pat. No. 5,399,491 (“NASBA”); Lizardi, U.S. Pat. No.5,854,033; Aono et al., Japanese Patent Pub. JP 4-262799 (rolling circleamplification); and the like. In one aspect, amplicons of the inventionare produced by PCR. An amplification reaction may be a “real-time”amplification if a detection chemistry is available that permits areaction product to be measured as the amplification reactionprogresses, e.g. “real-time PCR” described below, or “real-time NASBA”as described in Leone et al. (1998) Nucl. Acids Res. 26:2150, and likereferences. As used herein, the term “amplifying” means performing anamplification reaction. A “reaction mixture” means a solution containingall the necessary reactants for performing a reaction, which mayinclude, but not be limited to, buffering agents to maintain pH at aselected level during a reaction, salts, co-factors, scavengers, and thelike. Methods of “polony amplification” are also described in U.S. Pat.No. 6,432,360, U.S. Pat. No. 6,511,803 and U.S. Pat. No. 6,485,944.

“Complementary” or “substantially complementary” refers to thehybridization or base pairing or the formation of a duplex betweennucleotides or nucleic acids, such as, for instance, between the twostrands of a double stranded DNA molecule or between an oligonucleotideprimer and a primer binding site on a single stranded nucleic acid.Complementary nucleotides are, generally, A and T (or A and U), or C andG. Two single-stranded RNA or DNA molecules are said to be substantiallycomplementary when the nucleotides of one strand, optimally aligned andcompared and with appropriate nucleotide insertions or deletions, pairwith at least about 80% of the nucleotides of the other strand, usuallyat least about 90% to 95%, and more preferably from about 98 to 100%.Alternatively, substantial complementarity exists when an RNA or DNAstrand will hybridize under selective hybridization conditions to itscomplement. Typically, selective hybridization will occur when there isat least about 65% complementary over a stretch of at least 14 to 25nucleotides, preferably at least about 75%, more preferably at leastabout 90% complementary. See Kanehisa (1984) Nucl. Acids Res. 12:203.According to the present invention, useful MIP primer sequenceshybridize to sequences that flank the nucleotide base or series of basesto be captured.

“Complex” means an assemblage or aggregate of molecules in direct orindirect contact with one another. In one aspect, “contact,” or moreparticularly, “direct contact,” in reference to a complex of moleculesor in reference to specificity or specific binding, means two or moremolecules are close enough so that attractive noncovalent interactions,such as van der Waal forces, hydrogen bonding, ionic and hydrophobicinteractions, and the like, dominate the interaction of the molecules.In such an aspect, a complex of molecules is stable in that under assayconditions the complex is thermodynamically more favorable than anon-aggregated, or non-complexed, state of its component molecules. Asused herein, “complex” refers to a duplex or triplex of polynucleotidesor a stable aggregate of two or more proteins. In regard to the latter,a complex is formed by an antibody specifically binding to itscorresponding antigen.

“Duplex” means at least two oligonucleotides and/or polynucleotides thatare fully or partially complementary undergo Watson-Crick type basepairing among all or most of their nucleotides so that a stable complexis formed. The terms “annealing” and “hybridization” are usedinterchangeably to mean the formation of a stable duplex. In one aspect,stable duplex means that a duplex structure is not destroyed by astringent wash, e.g., conditions including temperature of about 5° C.less that the T_(m) of a strand of the duplex and low monovalent saltconcentration, e.g., less than 0.2 M, or less than 0.1 M. “Perfectlymatched” in reference to a duplex means that the polynucleotide oroligonucleotide strands making up the duplex form a double strandedstructure with one another such that every nucleotide in each strandundergoes Watson-Crick base pairing with a nucleotide in the otherstrand. The term “duplex” comprehends the pairing of nucleoside analogs,such as deoxyinosine, nucleosides with 2-aminopurine bases, PNAs, andthe like, that may be employed. A “mismatch” in a duplex between twooligonucleotides or polynucleotides means that a pair of nucleotides inthe duplex fails to undergo Watson-Crick bonding.

“Genetic locus,” or “locus” in reference to a genome or targetpolynucleotide, means a contiguous subregion or segment of the genome ortarget polynucleotide. As used herein, genetic locus, or locus, mayrefer to the position of a nucleotide, a gene, or a portion of a gene ina genome, including mitochondrial DNA, or it may refer to any contiguousportion of genomic sequence whether or not it is within, or associatedwith, a gene. In one aspect, a genetic locus refers to any portion ofgenomic sequence, including mitochondrial DNA, from a single nucleotideto a segment of few hundred nucleotides, e.g. 100-300, in length.Usually, a particular genetic locus may be identified by its nucleotidesequence, or the nucleotide sequence, or sequences, of one or bothadjacent or flanking regions. In another aspect, a genetic locus refersto the expressed nucleic acid product of a gene, such as an RNA moleculeor a cDNA copy thereof.

“Hybridization” refers to the process in which two single-strandedpolynucleotides bind non-covalently to form a stable double-strandedpolynucleotide. The term “hybridization” may also refer totriple-stranded hybridization. The resulting (usually) double-strandedpolynucleotide is a “hybrid” or “duplex.” “Hybridization conditions”will typically include salt concentrations of less than about 1 M, moreusually less than about 500 mM and even more usually less than about 200mM. Hybridization temperatures can be as low as 5° C., but are typicallygreater than 22° C., more typically greater than about 30° C., and oftenin excess of about 37° C. Hybridizations are usually performed understringent conditions, i.e., conditions under which a probe willhybridize to its target subsequence. Stringent conditions aresequence-dependent and are different in different circumstances. Longerfragments may require higher hybridization temperatures for specifichybridization. As other factors may affect the stringency ofhybridization, including base composition and length of thecomplementary strands, presence of organic solvents and extent of basemismatching, the combination of parameters is more important than theabsolute measure of any one alone. Generally, stringent conditions areselected to be about 5° C. lower than the T_(m) for the specificsequence at s defined ionic strength and pH. Exemplary stringentconditions include salt concentration of at least 0.01 M to no more than1 M Na ion concentration (or other salts) at a pH 7.0 to 8.3 and atemperature of at least 25° C. For example, conditions of 5×SSPE (750 mMNaCl, 50 mM Na phosphate, 5 mM EDTA, pH 7.4) and a temperature of 25-30°C. are suitable for allele-specific probe hybridizations. For stringentconditions, see for example, Sambrook, Fritsche and Maniatis, MolecularCloning A Laboratory Manual, 2nd Ed. Cold Spring Harbor Press (1989) andAnderson Nucleic Acid Hybridization, 1^(st) Ed., BIOS ScientificPublishers Limited (1999). “Hybridizing specifically to” or“specifically hybridizing to” or like expressions refer to the binding,duplexing, or hybridizing of a molecule substantially to or only to aparticular nucleotide sequence or sequences under stringent conditionswhen that sequence is present in a complex mixture (e.g., totalcellular) DNA or RNA.

“Hybridization-based assay” means any assay that relies on the formationof a stable complex as the result of a specific binding event. In oneaspect, a hybridization-based assay means any assay that relies on theformation of a stable duplex or triplex between a probe and a targetnucleotide sequence for detecting or measuring such a sequence. In oneaspect, probes of such assays anneal to (or form duplexes with) regionsof target sequences in the range of from 8 to 100 nucleotides; or inother aspects, they anneal to target sequences in the range of from 8 to40 nucleotides, or more usually, in the range of from 8 to 20nucleotides. A “probe” in reference to a hybridization-based assay meansa polynucleotide that has a sequence that is capable of forming a stablehybrid (or triplex) with its complement in a target nucleic acid andthat is capable of being detected, either directly or indirectly.Hybridization-based assays include, without limitation, assays that usethe specific base-pairing of one or more oligonucleotides as targetrecognition components, such as polymerase chain reactions, NASBAreactions, oligonucleotide ligation reactions, single-base extensionreactions, circularizable probe reactions, allele-specificoligonucleotide hybridizations, either in solution phase or bound tosolid phase supports, such as microarrays or microbeads, and the like.An important subset of hybridization-based assays include such assaysthat have at least one enzymatic processing step after a hybridizationstep. Hybridization-based assays of this subset include, withoutlimitation, polymerase chain reactions, NASBA reactions, oligonucleotideligation reactions, cleavase reactions, e.g., in INVADER™ assays,single-base extension reactions, probe circularization reactions, andthe like. There is extensive guidance in the literature onhybridization-based assays, e.g., Hames et al., editors, Nucleic AcidHybridization a Practical Approach (IRL Press, Oxford, 1985); Tijssen,Hybridization with Nucleic Acid Probes, Parts I & II (ElsevierPublishing Company, 1993); Hardiman, Microarray Methods and Applications(DNA Press, 2003); Schena, editor, DNA Microarrays a Practical Approach(IRL Press, Oxford, 1999); and the like. In one aspect,hybridization-based assays are solution phase assays; that is, bothprobes and target sequences hybridize under conditions that aresubstantially free of surface effects or influences on reaction rate. Asolution phase assay includes circumstances where either probes ortarget sequences are attached to microbeads such that the attachedsequences have substantially the same environment (e.g., permittingreagent access, etc.) as free sequences. In another aspect,hybridization-based assays include immunoassays wherein antibodiesemploy nucleic acid reporters based on amplification. In such assays,antibody probes specifically bind to target molecules, such as proteins,in separate reactions, after which the products of such reactions (i.e.,antibody-protein complexes) are combined and nucleic acid reporters areamplified. Preferably, such nucleic acid reporters includeoligonucleotide tags that are converted enzymatically into labeledoligonucleotide tags for analysis on a microarray, as described below.The following exemplary references disclose antibody-nucleic acidconjugates for immunoassays: Baez et al., U.S. Pat. No. 6,511,809; Sanoet al., U.S. Pat. No. 5,665,539; Eberwine et al., U.S. Pat. No.5,922,553; Landegren et al., U.S. Pat. No. 6,558,928; Landegren et al.,U.S. Patent Pub. 2002/0064779; and the like. In particular, the twolatter patent publications by Landegren et al. disclose steps of formingamplifiable probes after a specific binding event.

“Kit” refers to any delivery system for delivering materials or reagentsfor carrying out a method of the invention. In the context of assays,such delivery systems include systems that allow for the storage,transport, or delivery of reaction reagents (e.g., probes, enzymes, etc.in the appropriate containers) and/or supporting materials (e.g.,buffers, written instructions for performing the assay etc.) from onelocation to another. For example, kits include one or more enclosures(e.g., boxes) containing the relevant reaction reagents and/orsupporting materials for assays of the invention. In one aspect, kits ofthe invention comprise probes specific for polymorphic loci. In anotheraspect, kits comprise nucleic acid standards for validating theperformance of probes specific for polymorphic loci. Such contents maybe delivered to the intended recipient together or separately. Forexample, a first container may contain an enzyme for use in an assay,while a second container contains probes.

“Ligation” means to form a covalent bond or linkage between the terminiof two or more nucleic acids, e.g., oligonucleotides and/orpolynucleotides, in a template-driven reaction. The nature of the bondor linkage may vary widely and the ligation may be carried outenzymatically or chemically. As used herein, ligations are usuallycarried out enzymatically to form a phosphodiester linkage between a 5′carbon of a terminal nucleotide of one oligonucleotide with 3′ carbon ofanother oligonucleotide. A variety of template-driven ligation reactionsare described in the following references: Whitely et al., U.S. Pat. No.4,883,750; Letsinger et al., U.S. Pat. No. 5,476,930; Fung et al., U.S.Pat. No. 5,593,826; Kool, U.S. Pat. No. 5,426,180; Landegren et al.,U.S. Pat. No. 5,871,921; Xu and Kool (1999) Nucl. Acids Res. 27:875;Higgins et al., Meth. in Enzymol. (1979) 68:50; Engler et al. (1982) TheEnzymes, 15:3 (1982); and Namsaraev, U.S. Patent Pub. 2004/0110213.

“Microarray” refers in one embodiment to a type of multiplex assayproduct that comprises a solid phase support having a substantiallyplanar surface on which there is an array of spatially definednon-overlapping regions or sites that each contain an immobilizedhybridization probe. “Substantially planar” means that features orobjects of interest, such as probe sites, on a surface may occupy avolume that extends above or below a surface and whose dimensions aresmall relative to the dimensions of the surface. For example, beadsdisposed on the face of a fiber optic bundle create a substantiallyplanar surface of probe sites, or oligonucleotides disposed orsynthesized on a porous planar substrate creates a substantially planarsurface. Spatially defined sites may additionally be “addressable” inthat its location and the identity of the immobilized probe at thatlocation are known or determinable. Probes immobilized on microarraysinclude nucleic acids, such as oligonucleotide barcodes, that aregenerated in or from an assay reaction. Typically, the oligonucleotidesor polynucleotides on microarrays are single stranded and are covalentlyattached to the solid phase support, usually by a 5′-end or a 3′-end.The density of non-overlapping regions containing nucleic acids in amicroarray is typically greater than 100 per cm², and more preferably,greater than 1000 per cm². Microarray technology relating to nucleicacid probes is reviewed in the following exemplary references: Schena,Editor, Microarrays: A Practical Approach (IRL Press, Oxford, 2000);Southern, Current Opin. Chem. Biol., 2: 404-410 (1998); Nature GeneticsSupplement, 21:1-60 (1999); and Fodor et al., U.S. Pat. Nos. 5,424,186;5,445,934; and 5,744,305. A microarray may comprise arrays ofmicrobeads, or other microparticles, alone or disposed on a planarsurface or in wells or other physical configurations that can be sued toseparate the beads. Such microarrays may be formed in a variety of ways,as disclosed in the following exemplary references: Brenner et al.(2000) Nat. Biotechnol. 18:630; Tulley et al., U.S. Pat. No. 6,133,043;Stuelpnagel et al., U.S. Pat. No. 6,396,995; Chee et al., U.S. Pat. No.6,544,732; and the like. In one format, microarrays are formed byrandomly disposing microbeads having attached oligonucleotides on asurface followed by determination of which microbead carries whicholigonucleotide by a decoding procedure, e.g. as disclosed by Gundersonet al., U.S. Patent Pub. No. 2003/0096239.

“Microarrays” or “arrays” can also refer to a heterogeneous pool ofnucleic acid molecules that is distributed over a support matrix. Thenucleic acids can be covalently or noncovalently attached to thesupport. Preferably, the nucleic acid molecules are spaced at a distancefrom one another sufficient to permit the identification of discretefeatures of the array. Nucleic acids on the array may be non-overlappingor partially overlapping. Methods of transferring a nucleic acid pool tosupport media is described in U.S. Pat. No. 6,432,360. Bead basedmethods useful in the present invention are disclosed in PCT US05/04373.

“Amplifying” includes the production of copies of a nucleic acidmolecule of the array or a nucleic acid molecule bound to a bead viarepeated rounds of primed enzymatic synthesis. “In situ” amplificationindicated that the amplification takes place with the template nucleicacid molecule positioned on a support or a bead, rather than insolution. In situ amplification methods are described in U.S. Pat. No.6,432,360.

“Support” can refer to a matrix upon which nucleic acid molecules of anucleic acid array are placed. The support can be solid or semi-solid ora gel. “Semi-solid” refers to a compressible matrix with both a solidand a liquid component, wherein the liquid occupies pores, spaces orother interstices between the solid matrix elements. Semi-solid supportscan be selected from polyacrylamide, cellulose, polyamide (nylon) andcrossed linked agarose, dextran and polyethylene glycol.

“Randomly-patterned” or “random” refers to non-ordered, non-Cartesiandistribution (in other words, not arranged at pre-determined pointsalong the x- or y-axes of a grid or at defined “clock positions,”degrees or radii from the center of a radial pattern) of nucleic acidmolecules over a support, that is not achieved through an intentionaldesign (or program by which such design may be achieved) or by placementof individual nucleic acid features. Such a “randomly-patterned” or“random” array of nucleic acids may be achieved by dropping, spraying,plating or spreading a solution, emulsion, aerosol, vapor or drypreparation comprising a pool of nucleic acid molecules onto a supportand allowing the nucleic acid molecules to settle onto the supportwithout intervention in any manner to direct them to specific sitesthereon. Arrays of the invention can be randomly patterned or random.

“Heterogeneous” refers to a population or collection of nucleic acidmolecules that comprises a plurality of different sequences. Accordingto one aspect, a heterogeneous pool of nucleic acid molecules resultsfrom a preparation of RNA or DNA from a cell which may be unfractionatedor partially-fractionated.

“Nucleoside” as used herein includes the natural nucleosides, including2′-deoxy and 2′-hydroxyl forms, e.g. as described in Komberg and Baker,DNA Replication, 2nd Ed. (Freeman, San Francisco, 1992). “Analogs” inreference to nucleosides includes synthetic nucleosides having modifiedbase moieties and/or modified sugar moieties, e.g., described by Scheit,Nucleotide Analogs (John Wiley, New York, 1980); Uhlman and Peyman,Chemical Reviews, 90:543-584 (1990), or the like, with the proviso thatthey are capable of specific hybridization. Such analogs includesynthetic nucleosides designed to enhance binding properties, reducecomplexity, increase specificity, and the like. Polynucleotidescomprising analogs with enhanced hybridization or nuclease resistanceproperties are described in Uhlman and Peyman (cited above); Crooke etal., Exp. Opin. Ther. Patents, 6: 855-870 (1996); Mesmaeker et al.,Current Opinion in Structural Biology, 5:343-355 (1995); and the like.Exemplary types of polynucleotides that are capable of enhancing duplexstability include oligonucleotide phosphoramidates (referred to hereinas “amidates”), peptide nucleic acids (referred to herein as “PNAs”),oligo-2′-O-alkylribonucleotides, polynucleotides containing C-5propynylpyrimidines, locked nucleic acids (LNAs), and like compounds.Such oligonucleotides are either available commercially or may besynthesized using methods described in the literature.

“Oligonucleotide” or “polynucleotide,” which are used synonymously,means a linear polymer of natural or modified nucleosidic monomerslinked by phosphodiester bonds or analogs thereof. The term“oligonucleotide” usually refers to a shorter polymer, e.g., comprisingfrom about 3 to about 100 monomers, and the term “polynucleotide”usually refers to longer polymers, e.g., comprising from about 100monomers to many thousands of monomers, e.g., 10,000 monomers, or more.Oligonucleotides comprising probes or primers usually have lengths inthe range of from 12 to 60 nucleotides, and more usually, from 18 to 40nucleotides. Oligonucleotides and polynucleotides may be natural orsynthetic. Oligonucleotides and polynucleotides includedeoxyribonucleosides, ribonucleosides, and non-natural analogs thereof,such as anomeric forms thereof, peptide nucleic acids (PNAs), and thelike, provided that they are capable of specifically binding to a targetgenome by way of a regular pattern of monomer-to-monomer interactions,such as Watson-Crick type of base pairing, base stacking, Hoogsteen orreverse Hoogsteen types of base pairing, or the like.

Usually nucleosidic monomers are linked by phosphodiester bonds.Whenever an oligonucleotide is represented by a sequence of letters,such as “ATGCCTG,” it will be understood that the nucleotides are in 5′to 3′ order from left to right and that “A” denotes deoxyadenosine, “C”denotes deoxycytidine, “G” denotes deoxyguanosine, “T” denotesdeoxythymidine, and “U” denotes the ribonucleoside, uridine, unlessotherwise noted. Usually oligonucleotides comprise the four naturaldeoxynucleotides; however, they may also comprise ribonucleosides ornon-natural nucleotide analogs. It is clear to those skilled in the artwhen oligonucleotides having natural or non-natural nucleotides may beemployed in methods and processes described herein. For example, whereprocessing by an enzyme is called for, usually oligonucleotidesconsisting solely of natural nucleotides are required. Likewise, wherean enzyme has specific oligonucleotide or polynucleotide substraterequirements for activity, e.g., single stranded DNA, RNA/DNA duplex, orthe like, then selection of appropriate composition for theoligonucleotide or polynucleotide substrates is well within theknowledge of one of ordinary skill, especially with guidance fromtreatises, such as Sambrook et al., Molecular Cloning, Second Edition(Cold Spring Harbor Laboratory, New York, 1989), and like references.Oligonucleotides and polynucleotides may be single stranded or doublestranded.

“Oligonucleotide tag” or “tag” means an oligonucleotide that is attachedto a polynucleotide and is used to identify and/or track thepolynucleotide in a reaction. Usually, an oligonucleotide tag isattached to the 3′- or 5′-end of a polynucleotide to form a linearconjugate, sometime referred to herein as a “tagged polynucleotide,” orequivalently, an “oligonucleotide tag-polynucleotide conjugate,” or“tag-polynucleotide conjugate.” Oligonucleotide tags may vary widely insize and compositions; the following references provide guidance forselecting sets of oligonucleotide tags appropriate for particularembodiments: Brenner, U.S. Pat. No. 5,635,400; Brenner et al., Proc.Natl. Acad. Sci., 97: 1665; Shoemaker et al. (1996) Nature Genetics,14:450; Morris et al., EP Patent Pub. 0799897A1; Wallace, U.S. Pat. No.5,981,179; and the like. I n different applications of the invention,oligonucleotide tags can each have a length within a range of from 4 to36 nucleotides, or from 6 to 30 nucleotides, or from 8 to 20nucleotides, respectively. A tag that is useful in the present inventionto identify samples captured from a specific patient or other source isof sufficient length and complexity to distinguish it from sequencesthat identify other patients or sources of DNA being assayed inparallel. In one aspect, oligonucleotide tags are used in sets, orrepertoires, wherein each oligonucleotide tag of the set has a uniquenucleotide sequence. In some embodiment, particularly whereoligonucleotide tags are used to sort polynucleotides, or where they areidentified by specific hybridization, each oligonucleotide tag of such aset has a melting temperature that is substantially the same as that ofevery other member of the same set. In such aspects, the meltingtemperatures of oligonucleotide tags within a set are within 10° C. ofone another; in another embodiment, they are within 5° C. of oneanother; and in another embodiment, they are within 2° C. of oneanother. In another aspect, oligonucleotide tags are members of aminimally cross-hybridizing set. That is, the nucleotide sequence ofeach member of such a set is sufficiently different from that of everyother member of the set that no member can form a stable duplex with thecomplement of any other member under stringent hybridization conditions.In one aspect, the nucleotide sequence of each member of a minimallycross-hybridizing set differs from those of every other member by atleast two nucleotides. Such a set of oligonucleotide tags may have asize in the range of from two, three, four, five etc., up to ten andseveral tens to many thousands, or even millions, e.g., 50 to 1.6×10⁶.In another embodiment, such a size is in the range of from 200 to40,000; or from 200 to 40,000; or from 200 to 10,000.

In one embodiment, an amplifiable probe of the invention comprises atleast one oligonucleotide tag that is replicated and labeled to producea labeled oligonucleotide probe. Alternatively, where patient specifictags are envisioned the tag can be detected by stringent hybridizationor alternatively sequenced along with the target sequence. In oneembodiment, labeled oligonucleotide probes are hybridized to amicroarray of tag complements for detection. In this embodiment, foreach different locus of each different genome (e.g., from distinctpatients, patient samples or other sources) there is a unique labeledoligonucleotide tag. That is, the pair consisting of (i) the nucleotidesequence of the oligonucleotide tag and (ii) a label that generatesdetectable signal are uniquely associated with a particular locus of aparticular genome. The nature of the label on an oligonucleotide tag canbe based on a wide variety of physical or chemical properties including,but not limited to, light absorption, fluorescence, chemiluminescence,electrochemiluminescence, mass, charge, and the like. The signals basedon such properties can be generated directly or indirectly. For example,a label can be a fluorescent molecule covalently attached to anamplified oligonucleotide tag that directly generates an optical signal.Alternatively, a label can comprise multiple components, such as ahapten-antibody complex, that, in turn, may include fluorescent dyesthat generated optical signals, enzymes that generate products thatproduce optical signals, or the like. In certain aspects, the label onan oligonucleotide tag is a fluorescent label that is directly orindirectly attached to an amplified oligonucleotide tag. In one aspect,such fluorescent label is a fluorescent dye or quantum dot selected froma group consisting of from 2 to 6 spectrally resolvable fluorescent dyesor quantum dots. In a different embodiment, a set of samples could bequeried serially, i.e. using one tag at a time, with each of the tagsthat represent different patients, samples, etc., wherein each tag islabeled with the same label, and what is detected is binding or nobinding to members of the set of samples, thereby identifying in eachround a given patient's sample.

Fluorescent labels and their attachment to oligonucleotides, such asoligonucleotide tags, are described in many reviews, including Haugland,Handbook of Fluorescent Probes and Research Chemicals, Ninth Edition(Molecular Probes, Inc., Eugene, 2002); Keller and Manak, DNA Probes,2nd Edition (Stockton Press, New York, 1993); Eckstein, editor,Oligonucleotides and Analogues: A Practical Approach (IRL Press, Oxford,1991); Wetmur, Critical Reviews in Biochemistry and Molecular Biology,26:227-259 (1991); and the like. Particular methodologies applicable tothe invention are disclosed in the following sample of references: Funget al., U.S. Pat. No. 4,757,141; Hobbs, Jr., et al. U.S. Pat. No.5,151,507; Cruickshank, U.S. Pat. No. 5,091,519. In one aspect, one ormore fluorescent dyes are used as labels for labeled target sequences,e.g., as disclosed by Menchen et al., U.S. Pat. No. 5,188,934(4,7-dichlorofluorscein dyes); Begot et al., U.S. Pat. No. 5,366,860(spectrally resolvable rhodamine dyes); Lee et al., U.S. Pat. No.5,847,162 (4,7-dichlororhodamine dyes); Khanna et al., U.S. Pat. No.4,318,846 (ether-substituted fluorescein dyes); Lee et al., U.S. Pat.No. 5,800,996 (energy transfer dyes); Lee et al., U.S. Pat. No.5,066,580 (xanthine dyes): Mathies et al., U.S. Pat. No. 5,688,648(energy transfer dyes); and the like. Labelling can also be carried outwith quantum dots, as disclosed in the following patents and patentpublications: U.S. Pat. Nos. 6,322,901; 6,576,291; 6,423,551; 6,251,303;6,319,426; 6,426,513; 6,444,143; 5,990,479; 6,207,392; 2002/0045045;2003/0017264; and the like. As used herein, the term “fluorescent label”includes a signaling moiety that conveys information through thefluorescent absorption and/or emission properties of one or moremolecules. Such fluorescent properties include fluorescence intensity,fluorescence life time, emission spectrum characteristics, energytransfer, and the like.

Commercially available fluorescent nucleotide analogues readilyincorporated into the labeling oligonucleotides include, for example,Cy3-dCTP, Cy3-dUTP, Cy5-dCTP, Cy5-dUTP (Amersham Biosciences,Piscataway, N.J.), fluorescein-12-dUTP, tetramethylrhodamine-6-dUTP,TEXAS RED™-5-dUTP, CASCADE BLUE™-7-dUTP, BODIPY TMFL-14-dUTP, BODIPYTMR-14-dUTP, BODIPY TMTR-14-dUTP, RHODAMINE GREEN™-5-dUTP, OREGONGREENR™ 488-5-dUTP, TEXAS RED™-12-dUTP, BODIPY TM 630/650-14-dUTP,BODIPY TM 650/665-14-dUTP, ALEXA FLUOR™ 488-5-dUTP, ALEXA FLUOR™532-5-dUTP, ALEXA FLUOR™ 568-5-dUTP, ALEXA FLUOR™ 594-5-dUTP, ALEXAFLUOR™ 546-14-dUTP, fluorescein-12-UTP, tetramethylrhodamine-6-UTP,TEXAS RED™-5-UTP, mCherry, CASCADE BLUE™-7-UTP, BODIPY TM FL-14-UTP,BODIPY TMR-14-UTP, BODIPY TM TR-14-UTP, RHODAMINE GREEN™-5-UTP, ALEXAFLUOR™ 488-5-UTP, LEXA FLUOR™ 546-14-UTP (Molecular Probes, Inc. Eugene,Oreg.). Protocols are available for custom synthesis of nucleotideshaving other fluorophores. Henegariu et al., “CustomFluorescent-Nucleotide Synthesis as an Alternative Method for NucleicAcid Labeling,” Nature Biotechnol. 18:345-348 (2000).

Other fluorophores available for post-synthetic attachment include,inter alia, ALEXA FLUOR™ 350, ALEXA FLUOR™ 532, ALEXA FLUOR™ 546, ALEXAFLUOR™ 568, ALEXA FLUOR™ 594, ALEXA FLUOR™ 647, BODIPY 493/503, BODIPYFL, BODIPY R6G, BODIPY 530/550, BODIPY TMR, BODIPY 558/568, BODIPY558/568, BODIPY 564/570, BODIPY 576/589, BODIPY 581/591, BODIPY 630/650,BODIPY 650/665, Cascade Blue, Cascade Yellow, Dansyl, lissaminerhodamine B, Marina Blue, Oregon Green 488, Oregon Green 514, PacificBlue, rhodamine 6G, rhodamine green, rhodamine red, tetramethylrhodamine, Texas Red (available from Molecular Probes, Inc., Eugene,Oreg.), and Cy2, Cy3.5, Cy5.5, and Cy7 (Amersham Biosciences,Piscataway, N.J. USA, and others).

FRET tandem fluorophores may also be used, such as PerCP-Cy5.5, PE-Cy5,PE-Cy5.5, PE-Cy7, PE-Texas Red, and APC-Cy7; also, PE-Alexa dyes (610,647, 680) and APC-Alexa dyes.

Metallic silver particles may be coated onto the surface of the array toenhance signal from fluorescently labeled oligos bound to the array.Lakowicz et al. (2003) BioTechniques 34:62.

Biotin, or a derivative thereof, may also be used as a label on adetection oligonucleotide, and subsequently bound by a detectablylabeled avidin/streptavidin derivative (e.g. phycoerythrin-conjugatedstreptavidin), or a detectably labeled anti-biotin antibody. Digoxigeninmay be incorporated as a label and subsequently bound by a detectablylabeled anti-digoxigenin antibody (e.g. fluoresceinatedanti-digoxigenin). An aminoallyl-dUTP residue may be incorporated into adetection oligonucleotide and subsequently coupled to an N-hydroxysuccinimide (NHS) derivatized fluorescent dye, such as those listedsupra. In general, any member of a conjugate pair may be incorporatedinto a detection oligonucleotide provided that a detectably labeledconjugate partner can be bound to permit detection. As used herein, theterm antibody refers to an antibody molecule of any class, or anysub-fragment thereof, such as an Fab.

Other suitable labels for detection oligonucleotides may includefluorescein (FAM), digoxigenin, dinitrophenol (DNP), dansyl, biotin,bromodeoxyuridine (BrdU), hexahistidine (6× His), phosphor-amino acids(e.g. P-tyr, P-ser, P-thr), or any other suitable label. In oneembodiment the following hapten/antibody pairs are used for detection,in which each of the antibodies is derivatized with a detectable label:biotin/α-biotin, digoxigenin/a-digoxigenin, dinitrophenol (DNP)/α-DNP,5-Carboxyfluorescein (FAM)/α-FAM.

As mentioned above, oligonucleotide tags can be indirectly labeled,especially with a hapten that is then bound by a capture agent, e.g., asdisclosed in Holtke et al., U.S. Pat. Nos. 5,344,757; 5,702,888; and5,354,657; Huber et al., U.S. Pat. No. 5,198,537; Miyoshi, U.S. Pat. No.4,849,336; Misiura and Gait, PCT publication WO 91/17160; and the like.Many different hapten-capture agent pairs are available for use with theinvention, either with a target sequence or with a detectionoligonucleotide used with a target sequence, as described below.Exemplary, haptens include, biotin, des-biotin and other derivatives,dinitrophenol, dansyl, fluorescein, CY5, and other dyes, digoxigenin,and the like. For biotin, a capture agent may be avidin, streptavidin,or antibodies. Antibodies may be used as capture agents for the otherhaptens (many dye-antibody pairs being commercially available, e.g.,Molecular Probes, Eugene, Oreg.).

“Polymerase chain reaction,” or “PCR,” means a reaction for the in vitroamplification of specific DNA sequences by the simultaneous primerextension of complementary strands of DNA. In other words, PCR is areaction for making multiple copies or replicates of a target nucleicacid flanked by primer binding sites, such reaction comprising one ormore repetitions of the following steps: (i) denaturing the targetnucleic acid, (ii) annealing primers to the primer binding sites, and(iii) extending the primers by a nucleic acid polymerase in the presenceof nucleoside triphosphates. Usually, the reaction is cycled throughdifferent temperatures optimized for each step in a thermal cyclerinstrument. Particular temperatures, durations at each step, and ratesof change between steps depend on many factors well-known to those ofordinary skill in the art, e.g., exemplified by the references:McPherson et al., editors, PCR: A Practical Approach and PCR2: APractical Approach (IRL Press, Oxford, 1991 and 1995, respectively). Forexample, in a conventional PCR using Taq DNA polymerase, a doublestranded target nucleic acid may be denatured at a temperature greaterthan 90° C., primers annealed at a temperature in the range 50-75° C.,and primers extended at a temperature in the range 72-78° C.

The term “PCR” encompasses derivative forms of the reaction, includingbut not limited to, RT-PCR, real-time PCR, nested PCR, quantitative PCR,multiplexed PCR, and the like. Reaction volumes range from a few hundrednanoliters, e.g., 200 nL, to a few hundred microliters, e.g., 200microliters. “Reverse transcription PCR,” or “RT-PCR,” means a PCR thatis preceded by a reverse transcription reaction that converts a targetRNA to a complementary single stranded DNA, which is then amplified,e.g., Tecott et al., U.S. Pat. No. 5,168,038. “Real-time PCR” means aPCR for which the amount of reaction product, i.e., amplicon, ismonitored as the reaction proceeds. There are many forms of real-timePCR that differ mainly in the detection chemistries used for monitoringthe reaction product, e.g., Gelfand et al., U.S. Pat. No. 5,210,015(“Taqman”); Wittwer et al., U.S. Pat. Nos. 6,174,670 and 6,569,627(intercalating dyes); Tyagi et al., U.S. Pat. No. 5,925,517 (molecularbeacons). Detection chemistries for real-time PCR are reviewed in Mackayet al., Nucleic Acids Research, 30:1292-1305 (2002). “Nested PCR” meansa two-stage PCR wherein the amplicon of a first PCR becomes the samplefor a second PCR using a new set of primers, at least one of which bindsto an interior location of the first amplicon. As used herein, “initialprimers” in reference to a nested amplification reaction mean theprimers used to generate a first amplicon, and “secondary primers” meanthe one or more primers used to generate a second, or nested, amplicon.“Multiplexed PCR” means a PCR wherein multiple target sequences (or asingle target sequence and one or more reference sequences) aresimultaneously carried out in the same reaction mixture, e.g. Bernard etal. (1999) Anal. Biochem., 273:221-228 (two-color real-time PCR).Usually, distinct sets of primers are employed for each sequence beingamplified. “Quantitative PCR” means a PCR designed to measure theabundance of one or more specific target sequences in a sample orspecimen. Quantitative PCR includes both absolute quantitation andrelative quantitation of such target sequences. Quantitativemeasurements are made using one or more reference sequences that may beassayed separately or together with a target sequence. The referencesequence may be endogenous or exogenous to a sample or specimen, and inthe latter case, may comprise one or more competitor templates. Typicalendogenous reference sequences include segments of transcripts of thefollowing genes: β-actin, GAPDH, β₂-microglobulin, ribosomal RNA, andthe like. Techniques for quantitative PCR are well-known to those ofordinary skill in the art, as exemplified in the following references:Freeman et al., Biotechniques, 26:112-126 (1999); Becker-Andre et al.,Nucleic Acids Research, 17:9437-9447 (1989); Zimmerman et al.,Biotechniques, 21:268-279 (1996); Diviacco et al., Gene, 122:3013-3020(1992); Becker-Andre et al., Nucleic Acids Research, 17:9437-9446(1989); and the like.

“Polymorphism” or “genetic variant” means a substitution, inversion,insertion, or deletion of one or more nucleotides at a genetic locus, ora translocation of DNA from one genetic locus to another genetic locus.In one aspect, polymorphism means one of multiple alternative nucleotidesequences that may be present at a genetic locus of an individual andthat may comprise a nucleotide substitution, insertion, or deletion withrespect to other sequences at the same locus in the same individual, orother individuals within a population. An individual may be homozygousor heterozygous at a genetic locus; that is, an individual may have thesame nucleotide sequence in both alleles, or have a different nucleotidesequence in each allele, respectively. In one aspect, insertions ordeletions at a genetic locus comprises the addition or the absence offrom 1 to 10 nucleotides at such locus, in comparison with the samelocus in another individual of a population (or another allele in thesame individual). Usually, insertions or deletions are with respect to amajor allele at a locus within a population, e.g., an allele present ina population at a frequency of fifty percent or greater.

“Primer” includes an oligonucleotide, either natural or synthetic, thatis capable, upon forming a duplex with a polynucleotide template, ofacting as a point of initiation of nucleic acid synthesis and beingextended from its 3′ end along the template so that an extended duplexis formed. The sequence of nucleotides added during the extensionprocess are determined by the sequence of the template polynucleotide.Usually primers are extended by a DNA polymerase. Primers usually have alength in the range of between 3 to 36 nucleotides, also 5 to 24nucleotides, also from 14 to 36 nucleotides. Primers within the scope ofthe invention can be universal primers or non-universal primers. Pairsof primers can flank a sequence of interest or a set of sequences ofinterest. Primers and probes can be degenerate in sequence. Primerswithin the scope of the present invention bind adjacent to the targetsequence, whether it is the sequence to be captured for analysis, or atag that it to be copied.

“Solid support,” “support,” and “solid phase support” are usedinterchangeably and refer to a material or group of materials having arigid or semi-rigid surface or surfaces. In many embodiments, at leastone surface of the solid support will be substantially flat, although insome embodiments it may be desirable to physically separate synthesisregions for different compounds with, for example, wells, raisedregions, pins, etched trenches, or the like. According to otherembodiments, the solid support(s) will take the form of beads, resins,gels, microspheres, or other geometric configurations. Microarraysusually comprise at least one planar solid phase support, such as aglass microscope slide. Semisolid supports and gel supports are alsouseful in the present invention, especially when polony amplification isused.

“Specific” or “specificity” in reference to the binding of one moleculeto another molecule, such as a target sequence to a probe, means therecognition, contact, and formation of a stable complex between the twomolecules, together with substantially less recognition, contact, orcomplex formation of that molecule with other molecules. In one aspect,“specific” in reference to the binding of a first molecule to a secondmolecule means that to the extent the first molecule recognizes andforms a complex with another molecules in a reaction or sample, it formsthe largest number of the complexes with the second molecule.Preferably, this largest number is at least fifty percent. Generally,molecules involved in a specific binding event have areas on theirsurfaces or in cavities giving rise to specific recognition between themolecules binding to each other. Examples of specific binding includeantibody-antigen interactions, enzyme-substrate interactions, formationof duplexes or triplexes among polynucleotides and/or oligonucleotides,receptor-ligand interactions, and the like. As used herein, “contact” inreference to specificity or specific binding means two molecules areclose enough that weak non-covalent chemical interactions, such as vander Waal forces, hydrogen bonding, base-stacking interactions, ionic andhydrophobic interactions, and the like, dominate the interaction of themolecules.

“Spectrally resolvable” in reference to a plurality of fluorescentlabels means that the fluorescent emission bands of the labels aresufficiently distinct, i.e., sufficiently non-overlapping, thatmolecular tags to which the respective labels are attached can bedistinguished on the basis of the fluorescent signal generated by therespective labels by standard photodetection systems, e.g., employing asystem of band pass filters and photomultiplier tubes, or the like, asexemplified by the systems described in U.S. Pat. Nos. 4,230,558;4,811,218, or the like, or in Wheeless et al., pgs. 21-76, in FlowCytometry: Instrumentation and Data Analysis (Academic Press, New York,1985). In one aspect, spectrally resolvable organic dyes, such asfluorescein, rhodamine, and the like, means that wavelength emissionmaxima are spaced at least 20 nm apart, and in another aspect, at least40 nm apart. In another aspect, chelated lanthanide compounds, quantumdots, and the like, spectrally resolvable means that wavelength emissionmaxima are spaced at least 10 nm apart, and in a further aspect, atleast 15 nm apart.

“T_(m)” is used in reference to “melting temperature.” Meltingtemperature is the temperature at which a population of double-strandednucleic acid molecules becomes half dissociated into single strands.Several equations for calculating the T_(m) of nucleic acids are wellknown in the art. As indicated by standard references, a simple estimateof the T_(m) value may be calculated by the equation. T_(m)=81.5+0.41 (%G+C), when a nucleic acid is in aqueous solution at 1 M NaCl (see e.g.,Anderson and Young, “Quantitative Filter Hybridization,” in Nucleic AcidHybridization (1985). Other references (e.g., Allawi, H. T. & SantaLucia, J., Jr., Biochemistry 36, 10581-94 (1997)) include alternativemethods of computation which take structural and environmental, as wellas sequence characteristics into account for the calculation of T_(m).

“Sample” means a quantity of material from a biological, environmental,medical, or patient source in which detection or measurement of targetnucleic acids is sought. On the one hand it is meant to include aspecimen or culture (e.g., microbiological cultures). On the other hand,it is meant to include both biological and environmental samples. Asample may include a specimen of synthetic origin. Biological samplesmay be animal, including human, fluid, solid (e.g., stool or tissue), aswell as liquid and solid food and feed products and ingredients such asdairy items, vegetables, meat and meat by-products, and waste.Biological samples may include materials taken from a patient including,but not limited to cultures, cells, tissues, blood, saliva, cerebralspinal fluid, pleural fluid, milk, lymph, sputum, semen, needleaspirates, and the like. Biological samples may be obtained from all ofthe various families of domestic animals, as well as feral or wildanimals, including, but not limited to, such animals as ungulates, bear,fish, rodents, etc. Environmental samples include environmental materialsuch as surface matter, soil, water and industrial samples, as well assamples obtained from food and dairy processing instruments, apparatus,equipment, utensils, disposable and non-disposable items. These examplesare not to be construed as limiting the sample types applicable to thepresent invention.

It is to be understood that the embodiments of the present inventionwhich have been described are merely illustrative of some of theapplications of the principles of the present invention. Numerousmodifications may be made by those skilled in the art based upon theteachings presented herein without departing from the true spirit andscope of the invention. The contents of all references, patents andpublished patent applications cited throughout this application arehereby incorporated by reference in their entirety for all purposes.

The following examples are set forth as being representative of thepresent invention. These examples are not to be construed as limitingthe scope of the invention as these and other equivalent embodimentswill be apparent in view of the present disclosure, figures, tables, andaccompanying claims.

Example I Bisulfite Padlock Probes (BSPPs)

Bisulfite padlock probe (BSPP) technology, is a targeted method thatisolates selected locations for methylation profiling. In this example,a “padlock probe” refers a probe that was an approximately 100nucleotide DNA fragment that was designed to hybridize to genomic DNAtargets in a horseshoe manner (FIG. 1A) (Nilsson, M. et al., Padlockprobes: circularizing oligonucleotides for localized DNA detection.Science 265 (5181), 2085-2088 (1994); Hardenbol, P. et al., Multiplexedgenotyping with sequence-tagged molecular inversion probes. NatBiotechnol 21 (6), 673-678 (2003); Porreca et al., Multiplexamplification of large sets of human exons. Nat Methods 4 (11), 931-936(2007)). The gap between the two hybridized, locus-specific arms of apadlock probe is polymerized and ligated to form a circular strand ofDNA. These circles can then be amplified using the common “backbone”sequence that connects the two arms. This makes padlock probes highlymultiplexable, with tens of thousands of probes used within a singlereaction. The resulting libraries can then be sequenced with a massivelyparallel sequencing system. Padlock probes have been successfully usedto specifically amplify 10,000 human exons (Porreca et al., supra), andan over 10,000-fold improvement in capturing efficiency has been made.

To apply padlock probes to profiling DNA methylation, a probe set wasdesigned to target 10,000 locations in a bisulfite-treated human genome(Example IV and Table 1). Bisulfite treatment converts all unmethylatedcytosines to uracil, which is recognized as a thymine (Clark, S. J.,Harrison, J., Paul, C. L., & Frommer, M., High sensitivity mapping ofmethylated cytosines. Nucleic Acids Res 22 (15), 2990-2997 (1994)).These probes targeted the ENCODE regions, which represent ˜1% of thehuman genome and for which expression and chromatin immunoprecipitation(ChIP) data are available (Birney, E. et al., Identification andanalysis of functional elements in 1% of the human genome by the ENCODEpilot project. Nature 447 (7146), 799-816 (2007)). Rather than targetingpromoter regions or CpG islands, these probes were distributed randomlyover all ENCODE regions (Table 2).

TABLE 1 BSPP list (separate file) and description of columns. Table 1contains the sequences of the bisulfite padlock probes designed for oneexperiment, along with additional information and experimental data.“Estimated methylation” values only existed if there were 10 or moreobservations of a site, otherwise “NA” was given. Column Description 1ENCODE target region ID 2 bisulfite padlock probe sequence 3 chromosometarget 4 start position of 10 bp targeted span 5 end position of 10 bptargeted span 6 strand 7 position(s) of targeted CpG cytosines (commaseparated if more than one) 8 GM06990 technical replicate 1: number ofobservations 9 GM06990 technical replicate 1: estimated methylation 10GM06990 technical replicate 2: number of observations 11 GM06990technical replicate 2: estimated methylation 12 PGP1 lymphocyte: numberof observations 13 PGP1 lymphocyte: estimated methylation 14 PGP9lymphocyte: number of observations 15 PGP9 lymphocyte: estimatedmethylation 16 PGP1 fibroblast: number of observations 17 PGP1fibroblast: estimated methylation 18 PGP9 fibroblast: number ofobservations 19 PGP9 fibroblast: estimated methylation 20 PGP1 inducedpluripotent cells: number of observations 21 PGP1 induced pluripotentcells: estimated methylation 22 PGP9 induced pluripotent cells clone 1:number of observations 23 PGP9 induced pluripotent cells clone 1:estimated methylation 24 PGP9 induced pluripotent cells clone 2: numberof observations 25 PGP9 induced pluripotent cells clone 2: estimatedmethylation

TABLE 2 Distribution statistics for BSPP probes and MSCC sites. Thistable gives some statistics on the sites profiled by BSPP and MSCCmethods, and compares these to hypothetical profiles of “all CpG sites”and “all genomic sequence.” BSPP MSCC genomic (ENCODE set) (uniqueHpaII) all CpG sites sequence number 9,552 probes 1,417,432 28,485,346NA (10,704 CpGs) within CpG 1.2% 13.5% 7.5% 0.7% islands within 1 kb5.7% 3.4% 2.3% 1.3% of TSS inside genes 54.6%  47.8% 43.3% 34.3% within0% (by design) 33.5% 51.5% 48.8% repetitive sequence

All data was produced using the March 2006 human reference sequence(NCBI Build 36.1), downloaded from University of California at SantaCruz (UCSC). CpG islands were based on UCSC's CpG island annotation.Transcription start sites (TSS) and gene locations were calculated usingUCSC's RefGene list. Repetitive sequence was based on the letter casingin the genome sequence, produced by UCSC.

Without intending to be bound by scientific theory, given that no effortwas made to target gene regions in the design, it seemed unlikely that54.6% of BSPP probes were within genes. However, this is consistent withthe fact that approximately 60% of ENCODE regions are in gene transcriptregions by calculations that were performed (based on RefGeneannotations). To simplify design, the BSPP probes avoided targetingsites with CpGs in the hybridizing arms: approximately 60% of all CpGsites and approximately 98% of CpGs within CpG islands were excluded bythis criterion alone from potentially being assayed.

An initial experiment used the BSPP set to investigate cytosinemethylation in the GM06990 EBV-transformed B-lymphocyte cell line, acell line also used in the ENCODE project (Birney, E. et al., supra)(Example IV). The expected size band was observed and isolated from thegel (FIG. 5A) and, to check the specificity of the capturing, 75individual library molecules were cloned and sequenced. All were uniqueand mapped to the desired target regions, illustrating the highspecificity padlock probe technology could achieve despite the reducedgenomic complexity after bisulfite conversion. Technical replicates ofcapturing were performed followed by Illumina Genome Analyzer (formerlySolexa) sequencing to check the reproducibility of the method and foundthat both the numbers of probe observations (FIGS. 5B and 5C) and theinferred methylation levels (FIG. 1B) were highly correlated. To ruleout the possibility of systemic bias, traditional Sanger sequencing wasperformed on 33 regions amplified from bisulfite treated DNA. Themethylation levels determined by this method were highly correlated withthe BSPP-determined methylation (FIG. 1C and Example IV). Methylationlevels were bimodally distributed with most sites <20% or >80%methylated (FIG. 6A), which is consistent with previous reports(Eckhardt, F. et al., DNA methylation profiling of human chromosomes 6,20 and 22. Nat Genet 38 (12), 1378-1385 (2006)).

Because the sequencing was clonal, BSPP data could be used toinvestigate strand-specific information (e.g., correlation withneighboring CpG sites or SNPs). It was determined that, within probesspanning more than one CpG, there was a positive correlation between themethylation states of those CpGs on individual strands (FIG. 7).

Example II Methyl Sensitive Cut Counting (MSCC)

To explore the relationship between methylation and gene expressionlevels in the promoter region and elsewhere in the gene, ENCODE projectgene expression data was used for this cell line to split genes into twoequal groups: “highly expressed” and “lowly expressed” genes. For eachgroup plotted, median cytosine methylation was plotted against geneposition (FIG. 2A). In the highly expressed genes, a pattern of lowmethylation was observed in the promoter region and high methylation wasobserved in the rest of the gene body. The lowly expressed genes hadmoderate methylation in both promoter and gene-body regions.

Without intending to be bound by scientific theory, cytosine methylationis an epigenetic feature that may interact with other epigeneticfeatures such as histone modifications. To look for correlations betweenDNA methylation and histone modification, available ChIP data (Birney,E. et al., Identification and analysis of functional elements in 1% ofthe human genome by the ENCODE pilot project. Nature 447 (7146), 799-816(2007)) was compared with the methylation presented herein. Cytosinemethylation was observed and was correlated with H3K36 methylation andanti-correlated with H3K27 methylation (FIG. 8). Without intending to bebound by scientific theory, these correlations likely reflect thedistribution of the probes, half of which fell within gene bodies (only5% were within 1 kb of transcription start sites). The correlations wereconsistent with the gene-body pattern of the histone modifications:H3K36 methylation is higher in the gene-body of highly expressed genes,while H3K27 is high in the gene-body of lowly expressed genes (Barski,A. et al., High-resolution profiling of histone methylations in thehuman genome. Cell 129 (4), 823-837 (2007)).

The methylation profiling methods will, in part, be used to deeplyexplore the relationship between genotype and phenotype throughcollection of multi-faceted biological information for individualsregistered within the personal genome project (Church, G. M., Thepersonal genome project. Mol Syst Biol 1, 2005 0030 (2005)). To explorehow methylation patterns vary between different cell types and differentindividuals, the ENCODE BSPP set was applied to several cell lines fromthe PGP: PGP1 EBV-transformed B-lymphocytes, PGP9 EBV-transformedB-lymphocytes, PGP9 fibroblasts, and induced pluripotent stem cells(iPS) derived from PGP9 fibroblasts (Section IV). Consistent withprevious studies (Eckhardt, F. et al., DNA methylation profiling ofhuman chromosomes 6, 20 and 22. Nat Genet 38 (12), 1378-1385 (2006)),the methylation patterns of lymphoblast lines derived from differentindividuals were highly correlated (FIG. 9A), while the correlationbetween fibroblast and lymphoblast cells from the same individual wasmuch lower (FIG. 9B). The PGP9 iPS cells were hypermethylated in theENCODE regions of ˜400 genes, compared to the fibroblast line they werederived from (FIG. 9C). Further investigation is needed as cellculturing can affect global methylation levels (Meissner, A. et al.,Genome-scale DNA methylation maps of pluripotent and differentiatedcells. Nature 454 (7205), 766-770 (2008)). Using gene expression datadescribed herein, it was observed that the phenomenon of gene-bodymethylation was repeated in the PGP cell lines (Example IV and FIG. 10).

Methyl sensitive cut counting (MSCC), is a whole genome methylationprofiling method. MSCC queries the sensitivity of all CCGG sites withinthe genome to HpaII, a methylation sensitive restriction enzyme thatcuts at CCGG sequences. Methylation sensitive restriction enzymes are acommon tool for studying methylation: these enzymes typically have arecognition site that contains a CpG dinucleotide and are blocked fromcutting if that site is methylated (Bird, A. P. & Southern, E. M., Useof restriction enzymes to study eukaryotic DNA methylation: I. Themethylation pattern in ribosomal DNA from Xenopus laevis. J Mol Biol 118(1), 27-47 (1978)). With MSCC, no choice is made for which sites aretargeted—all uniquely identifiable HpaII sites are profiled. Bygenerating a library of tag fragments from all cut locations and thenusing massively parallel sequencing to gather millions of observationsof these, one of skill in the art can infer the methylation level by thenumber of tags observed (FIG. 3A). Sites with many v. no reads wereinferred to have low or high methylation levels, respectively. A controllibrary was also constructed by replacing HpaII with amethylation-insensitive isoschizomer, MspI. However, the additional costincurred does not seem warranted, as the data indicated that the HpaIIlibrary alone was highly correlated with methylation at individual sites(see below).

The human genome contains 2.3 million HpaII sites and each of these, ifcut, can generate two possible library tags. Of the 4.6 million possiblesequences, about half (2.3 million) are considered “unique,” i.e., theyhave more than one base difference when compared to any other possiblesequence. Of the 2.3 million sites, 888,455 produce two unique tags and528,977 produce a single unique tag. These combine to a total of1,417,432 genomic locations that are profiled with this method. Nearlyhalf of these sites occur within genes (>18,000 genes have at least onesite within them) and 13.5% are within CpG islands (90% of CpG islandshave at least one site within them) (Table 2).

An MSCC HpaII library and an MspI control library were produced for thePGP1 EBV-transformed B-lymphocyte cell line, for which BSPP and geneexpression data had been obtained. Libraries were sequenced using anIllumina Genome Analyzer and matched to a list of all possible tagsequences (Table 5). Two technical replicates were made of the HpaIIlibrary that, although subject to variance according to the Poissondistribution, showed a high correlation in the number of observationsfor each site (R=0.82, FIG. 11). The availability of BSPP data for thesame sample enabled comparison of the methylation levels determined byBSPPs with MSCC HpaII data for 381 sites (726 individual tags) (FIG.15). When data were binned according to the BSPP-determined methylationlevels, the average number of counts for each bin was linearly relatedto its methylation level (FIG. 3B). This was used to estimate averagemethylation levels when counts for multiple sites are averaged. BSPPmethylation data could also be used to estimate methylation levels forindividual sites based on MSCC HpaII counts (FIG. 12).

MSCC counts had more noise for sites containing more than one HpaIIrecognition site.

As a result, MSCC was more accurate at distinguishing moderatelymethylated sites from highly methylated sites than it was fordistinguishing moderately from weakly methylated sites, although deepersequencing coverage should improve accuracy (FIG. 17). In addition,preliminary data indicated that the accuracy could be improved bysequencing an ‘inverse library’ of methylated CCGG sites, which wasconstructed by dephosphorylating HpaII-digested fragment ends, digestingwith MspI and then ligating an MmeI-containing adaptor to generatesequencing tags (FIG. 16). In the following analyses, however, only theMSCC HpaII data generated from three lanes of Illumina sequencing wasutilized.

Compared to BSPP, which targeted several thousand data points coveringapproximately 400 genes, the MSCC technology covered the entire genome,allowing examination of the relationship between gene expression andcytosine methylation more thoroughly. Genes were split into five equalgroups based on their expression levels and the running average of MSCCobservations vs. gene position was plotted for each (FIGS. 2B-2D). Asimilar pattern of low promoter methylation and high gene bodymethylation was observed in high expression genes.

To investigate the amount of information contained within CpG islands,the data were separated into two groups according to whether a givensite was located within or outside a CpG island (FIGS. 2E and 2F).Although methylation of CpG islands is known to suppress transcription,it was determined that, on average, sites outside CpG islands wereresponsible for most of the difference in methylation levels observedbetween the promoters of high and low expression genes.

To explore how methylation information was correlated with geneexpression on the level of individual genes, gene promoter methylationand gene body methylation of individual genes were compared. Accordingto these two metrics, genes formed two clusters that corresponded tohigh and low expression levels (FIG. 4A). Gene body methylation appearedto be bimodally distributed, with two peaks corresponding to highlyexpressed and lowly expressed genes (FIG. 4B).

The rapid development of cheaper, massively parallel sequencingtechnologies (Khulan, B. et al., Comparative isoschizomer profiling ofcytosine methylation: the HELP assay. Genome Res 16 (8), 1046-1055(2006)) is opening the way for new strategies for studying biologicalprocesses (Schuster, S. C., Next-generation sequencing transformstoday's biology. Nat Methods 5 (1), 16-18 (2008); Mardis, E. R.,Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 9,387-402 (2008); Kahvejian, A., Quackenbush, J., & Thompson, J. F., Whatwould you do if you could sequence everything? Nat Biotechnol 26 (10),1125-1133 (2008)), including epigenetic features like DNA methylation(Meissner, A. et al., Genome-scale DNA methylation maps of pluripotentand differentiated cells. Nature 454 (7205), 766-770 (2008); Cokus, S.J. et al., Shotgun bisulphite sequencing of the Arabidopsis genomereveals DNA methylation patterning. Nature 452 (7184), 215-219 (2008);Lister, R. et al., Highly integrated single-base resolution maps of theepigenome in Arabidopsis. Cell 133 (3), 523-536 (2008)). BSPP and MSCCare two complementary methods that take advantage of the cheap, accurateand quantitative nature of new sequencing technologies to profilecytosine methylation at single-base resolution in targeted andgenome-wide surveys.

Example III Discussion

The data presented herein from both BSPP and MSCC methods shows apattern of gene body methylation in the highly expressed genes of humancell lines. This is a phenomenon that has already been observed inArabidopsis (Cokus, S. J. et al., Shotgun bisulphite sequencing of theArabidopsis genome reveals DNA methylation patterning. Nature 452(7184), 215-219 (2008); Lister, R. et al., Highly integrated single-baseresolution maps of the epigenome in Arabidopsis. Cell 133 (3), 523-536(2008); Zhang, X. et al., Genome-wide high-resolution mapping andfunctional analysis of DNA methylation in arabidopsis. Cell 126 (6),1189-1201 (2006); Zilberman, D., Gehring, M., Tran, R. K., Ballinger,T., & Henikoff, S., Genome-wide analysis of Arabidopsis thaliana DNAmethylation uncovers an interdependence between methylation andtranscription. Nat Genet 39 (1), 61-69 (2007)), where it is associatedwith active genes. There is growing evidence in mammals: gene bodymethylation has been observed in the active human X chromosome whencompared to the inactive X (Hellman, A. & Chess, A., Gene body-specificmethylation on the active X chromosome. Science 315 (5815), 1141-1143(2007)), and low methylation sites in the gene body have been associatedwith low expression genes in cancer cell lines (Shann, Y. J. et al.,Genome-wide mapping and characterization of hypomethylated sites inhuman tissues and breast cancer cell lines. Genome Res 18 (5), 791-801(2008)). Gene body methylation has been hypothesized to suppressspurious initiation of transcription within active genes in Arabidopsis(Zhang, X. et al., Genome-wide high-resolution mapping and functionalanalysis of DNA methylation in arabidopsis. Cell 126 (6), 1189-1201(2006); Zilberman, D., Gehring, M., Tran, R. K., Ballinger, T., &Henikoff, S., Genome-wide analysis of Arabidopsis thaliana DNAmethylation uncovers an interdependence between methylation andtranscription. Nat Genet 39 (1), 61-69 (2007)) and a similar functionmay exist in mammals (Suzuki, M. M. & Bird, A., DNA methylationlandscapes: provocative insights from epigenomics. Nat Rev Genet 9 (6),465-476 (2008)).

In addition, it was determined that expression-related differences inpromoter regions were much larger, on average, for CpGs outside islands.CpG islands and promoters have been the preferred target of many studiesand have, in the past, guided the design of many methylation profilingexperiments (Meissner, A. et al., Genome-scale DNA methylation maps ofpluripotent and differentiated cells. Nature 454 (7205), 766-770 (2008);Illingworth, R. et al., A novel CpG island set identifiestissue-specific methylation at developmental gene loci. PLoS Biol 6 (1),e22 (2008); Weber, M. et al., Distribution, silencing potential andevolutionary impact of promoter DNA methylation in the human genome. NatGenet 39 (4), 457-466 (2007); Bibikova, M. et al., High-throughput DNAmethylation profiling using universal bead arrays. Genome Res 16 (3),383-393 (2006)). In light the observations presented herein, less biasedprofiling methods are powerful in that they enable one of skill in theart the ability to discover aspects of methylation that might otherwisehave been missed. As DNA sequencing costs drop, tools like BSPP and MSCCcan be readily applied to study the epigenomic changes associated withdevelopmental stages, environmental changes, disease states and thelike.

Example IV

Materials and Methods

Using approximately 30,000 padlock probes generated from Agilent'soligonucleotides

(FIG. 9A), three critical steps of technical development in thegenotyping phase were tested (FIG. 9B). Although the protocol forallelic-specific extension and circularization (Step 8) had been wellestablished (Hardenbol et al. (2003) Nat. Biotechnol. 21:673; Hardenbolet al. (2005) Genome Res. 15:269), two critical points wereidentified: 1) using Apyrase to remove contaminating nucleotides; and 2)adding polymerase and ligase after probes were annealed to the genomictemplates) to ensure specific extension and ligation.

It was also determined that, due to the low ligation efficiency ongenomic templates, amplification of circularized padlock probes by PCR(Step 10) was associated with high amplification biases. However, it wasalso determined that a pre-PCR Rolling Circle Amplification using eitherBst polymerase or phi29 polymerase reduced the biases dramatically. Thegenotyping assay was verified using Sanger Sequencing, and it wasconfirmed that the genotyping assay is specific. In addition, it wasdetermined that, in designing padlock probes, SNPs located withinrepetitive regions of the human genome should not be included, becausethe corresponding padlock probes tended be present at very high copynumbers after circularization and reduce the efficiency of genotypingassay.

Cell Lines, Genomic DNA and RNA

Genomic DNA of GM06990 (a HapMap/ENCODE sample) was obtained fromCoriell Cell Repository (Worldwide Website:ccr.coriell/org/Sections/Search/Sample_Detail.aspx?Ref=GM06990).

With the approval from Harvard Medical School's Institutional ReviewBoards (IRB), blood and skin biopsies were obtained from donors of thePersonal Genome Project (PGP, Worldwide Website: personalgenomes.org).The EBV-transformed B-lymphocyte cell lines and the derivative genomicDNA for donors PGP1 (GM20431, Worldwide Website:ccr.coriell.org/Sections/Search/Sample_Detail.aspx?Ref=GM20431) and PGP9(GM21833, Worldwide Websiteccr.coriell.org/Sections/Search/Sample_Detail.aspx?Ref=GM21833) weregenerated and acquired from Coriell Cell Repository. Genomic DNAobtained directly from Coriell was used for methylation analysis ofthese lines, cultured cell lines were used for gene expressionprofiling. The primary fibroblast line for PGP9 was generated by andobtained from the Brigham Women's Hospital. The cultured cell line wasused for both genomic DNA and gene expression profiling.

The iPS cell lines were derived by infecting primary human fibroblastsof PGP9 with highly concentrated retroviral OCT3, KLF4, SOX2 and c-MYCparticles (Park, I. H. et al., Generation of human-induced pluripotentstem cells. Nat Protoc 3 (7), 1180 (2008)). The infected cells weretrypsinized onto a feeder-layer after 4 days and maintained in hES media(KO-DMEM (Invitrogen), 20% KO-SR (Invitrogen), L-glutamine,non-essential amino acid, pen/strep, 55 μM 2-mercaptoethanol and 10ng/ml bFGF). The iPS colonies were identified by their characteristicmorphology after 3-4 weeks.

Immortalized lymphocytes were cultured in RPMI-1640 medium (Invitrogen)with 10% FBS (Invitrogen) and L-Glutamine (Invitrogen). Primaryfibroblasts were cultured in DMEM/F12 medium with 15% FBS and 10 ng/μlEGF. Human iPS cell lines were grown on a feeder layer of mouseembryonic fibroblasts (Global Stem) in hES media, and mechanicallyseparated from mouse cells prior to DNA/RNA extraction.

Genomic DNAs and total RNAs were extracted with AllPrep DNA/RNA/ProteinMini Kit (Qiagen).

RNA gene expression profiling for the PGP cell lines was done usingIllumina's bead array technology (Worldwide Website:www.illumina.com/pages.ilmn?ID=5) through the service by Harvard PartnerCenter for Genetics and Genomics.

Bisulfite Padlock Probe (BSPP) Design

A given set of padlock probes consists of two unique “arm” sequencesthat anchor a specific locus of interest, and a common center “backbone”sequence connecting two arms, the extension arm and the and ligationarm. The extension arm was the one where polymerization started from,and the ligation arm was the one where the polymerization ended andligation occurred.

In the BSPP set, arms were up to 28 base pairs each in length, which wasset to achieve a Tm range for specific capturing (see below). Probeswere designed to target bisulfite-treated human DNA sequences chosenfrom the ENCODE regions. Potential probe locations were chosen accordingto the following criteria: first, a CpG dinucleotide existed at the 3′end of the 10 base pair target, at the junction between the gap andligation arm; second, CpG-free regions (for arm design) existed upstreamand downstream of the 10 base pair span. The first criterion was chosenfor two reasons: (1) the correct incorporation of CpG (or TpG ifcytosine is unmethylated) depends on the high fidelity of bothpolymerase and ligase, and (2) the queried cytosine is positioned asclose as possible to the sequencing primer in the final librarymolecules. Because the common backbone sequence was used as a sequencingprimer, 28 bases of the arm sequence were sequenced before informativebases from the span were reached. Thus, there was a concern that thetargeted CpG should be as close as possible in order to fall within thesequenced region and reduce low quality reads. Notably, because of thesecond criteria (no CpGs within the arms), the sites that were chosenfrom were actually biased against falling within CpG islands.

ENCODE region sequences were downloaded from UCSC in the form of Fastafiles (Worldwide Website: genome.ucsc.edu/ENCODE/). All CpGs meetingthese criteria were considered, and both strand possibilities wereconsidered. Because there was concern about how well padlock probeswould work with bisulfite-treated DNA, made no effort was made to targetparticular locations inside the ENCODE region. Instead, probes thatseemed most optimal according to various criteria were chosen.

Each possible probe site was then subject to the following requirements:(A) arm length was chosen to fit a narrow melting temperature range of50-55° C. (a Perl code to calculate melting temperature available uponrequest), (B) a minimum “G” content of 20% was required of both arms,(C) at least three non-CpG cytosines within the 15 bases of the“ligation” arm (to prevent probes targeting unconverted DNA).

Because bisulfite treated DNA loses a lot of sequence complexity asalmost all cytosines become uracil and base-pair as thymines, there wasconcern about sequence uniqueness. This was addressed with steps D-F:(D) potential sites were selected based on average uniqueness for eacharm, defined as the average frequency of all 15 base pair segmentswithin it, (E) sites were selected based on an “internal uniqueness”score, the product of frequencies of the internal 15-mers on either sideof the 10 base pair span, (F) potential mis-hybridization targets forremaining probes were removed by searching with BLAST against a custom“bisulfite treated” human genome (all non-CpG C's converted to T).Finally (step G), the set was trimmed to remove probes that overlappedin location to avoid hybridization conflicts.

Starting with 152,658 possible sites each step discarded the followingpercentage of probes:

(A: Tm range) 70% (45,282 remaining) (B: G content) 12% (39,675remaining) (C: non CpG C's) 40% (23,020 remaining) (D & E: uniqueness)24% (17,468 remaining) (F: BLAST) 36% (11,166 remaining) (G: no overlap)14% (9,552 remaining)

Because some probes contained more than one CpG within the 10 base pairspan, this final set of 9,552 probes targeted 10,704 CpGs. It isnoteworthy that the criteria chosen in this experiment targeting ENCODEregions were very stringent; accordingly these data not necessarilysuggest that only a tiny fraction of CpG sites are targetable.

Bisulfite Padlock Probe Synthesis and Processing

To make the padlock probes amplifiable, 150-mer oligos having thefollowing structure were designed. The common sequences (underlined) atboth ends will be used to amplify the pool of synthesized oligos.

ACGGGTGGAAGATGGATGAT[ligation_28 nt]AGATCGGAAGAGCGTCGTGTAGGGAAAGCTGAGCAAATGTTATCGAGGTC[extension_28nt]GATCGTCCTTACACACTAGCCGTC

Using a programmable microarray (Agilent Technologies), a total of 9,552such oligos (150-mers) were synthesized, which were cleaved off andcollected in a single tube. The oligos were estimated to be about 0.18fmol/species.

To amplify the oligos, real time PCR was performed on 1% of thesynthesized oligos (1.8 amol/species, or ˜1 million molecules/species)to monitor the amplicon in a 100 μl reaction assembled with Platinum Taqsupermix, 50 pmol each of primers (AP1_BS10.SS.U:A*C*G*GGTGGAAGATGGATGAU; *: phosphothioate modification, U:deoxyuridine), and AP2_BS10.p: /5phos/GACGGCTAGTGTGTAAGGAC), and 0.5×SYBR Green. The PCR program was 95° C. for 5 minutes, 15 cycles of 95°C. 30 seconds, 58° C. 1 minute, and 72° C. 1 minute, and finally 72° C.for 5 minutes. The PCR product was purified with Qiagen PCR purificationkit and subsequently quantified. Using a 96-well plate, a 9.6 ml of PCRreaction was set up with 25 fmol of template along with Platinum Taqsupermix, 4.8 nmol each of primers, and 0.5× SYBR Green. The same PCRprogram was used. The PCR products were purified by phenol:chloroformfollowed by Qiagen PCR purification kit, and a total of 37 μg of DNA wasobtained.

The purified PCR product was split into eight reactions with 10 units oflambda exonuclease in 1× lambda exonuclease reaction buffer, andincubated at 37° C. for 45 minutes then 75° C. for 15 minutes. Afterbeing purified with QiaQuick columns, the ssDNA was quantified withNanodrop to be 33 ng/μl in 200 μl total. This was split into four tubes,each of which was assembled with 50 μl of ssDNA (33 ng/μl), 6 μl of 10×DpnII reaction buffer, and 2 μl of 100 μM Guide DpnII BS10(GGCTAGTGTGTAAGGACGATCNN). The “guide” oligo was annealed to the ssDNAby bringing the reaction to 95° C. for 5 minutes, followed by ramp to60° C. at 0.1° C./sec, then 60° C. for 10 minutes, and 37° C. for 1minute. In each tube, 5 μl of DpnII (10 u/μl) (NEB) and 5 μl of USERenzyme (1 u/μl) (NEB) were added and incubate at 37° C. for 3 hours. Thefinal product was loaded into multiple lanes of 6% TBE Urea precast gels(Invitrogen), and the desired band was cut and subsequently purified.Finally, the concentration of padlock probes was quantified on a 6% TBEurea gel along with a quantitative low mass DNA ladder (Invitrogen). Theprobes were at 9 ng/μl, which is 257 nM (27 pM for each of 9,552species).

Bisulfite Treatment of Genomic DNA

Bisulfite treatment was performed using the EZ DNA Methylation-Gold Kit(Zymo

Research). The genomic DNA (2-10 μg) was split into multiple tubes with500 ng each, and converted with sodium bisulfite according tomanufacture's protocol. The typical yield was 50-75% after bisulfiteconversion. The final product was eluted with dH₂O and concentrated tobe >100 ng/μl.

CpG Padlock Capturing and Construction of Sequencing Libraries

A 15 μl of reaction was set up using 1× Ampligase buffer (Epicentre), 1μg (˜0.5 amol of haploid) of genomic DNA, and 33.5 ng (˜1 pmol) ofprobes. Using a thermal cycler, the reaction was denatured at 95° C. for10 minutes, ramped to 64° C. for 5 hours, then 62° C. for 5 hours, andfinally hybridized at 60° C. for 24 hours. At 60° C., the gap fillingand sealing mix was added (0.5 pmol of dNTPs (USB), 2 units of TaqStoffel fragment (Applied Biosystems), and 2.5 units of Ampligase(Epicentre) in Ampligase storage buffer (Epicentre)) totaling 2 μl, andincubated the reaction at 60° C. for 2 hours. The reaction was thencycled 5 times of 95° C. for 2 minutes and 60° C. for 5 hours. To digestthe linear DNA, the incubation temperature was lowered to 37° C., and 2μl of Exonuclease I (20 units/μl) (USB) and 2 μl of Exonuclease III (200units/μl) (USB) was immediately added. The reaction was incubated at 37°C. for 2 hours followed by 94° C. for 5 minutes.

To amplify the circularized padlock probes, for each sample two 100 μlreactions were set up, each of which was assembled with 50 μl of 2× iQSYBR Green supermix (Bio-Rad), 10 μl of template from above, 40 pmoleach of primers (CIR_for_SLXA:CAAGCAGAAGACGGCATACGAGCTCTGAGCAAATGTTATCGAGGTC and CIR_rev_SLXA:AATGATACGGCGACCACCGACACTCTTTCCCTACACGACGCTC), and water. The reactionswere carried out on a real time PCR instrument to avoidover-amplification by monitoring the amplicons. The PCR program was 96°C. for 3 minutes, 5 cycles of 96° C. for 15 seconds, 60° C. for 30seconds, and 72° C. for 30 seconds, then 13 cycles of 96° C. for 15seconds, and 72° C. for 1 minute, and finally 72° C. for 5 minutes. Asmall fraction (-2%) of PCR product was run on a TBE polyacrylamide gelto check if the desired bands were present (e.g., FIG. 5A), and if so,the rest was loaded and the desired band was cut from the gel. To purifythe DNA from the polyacrylamide gel, 3× volume of TE (pH 8.0) was addedfollowed by incubation at 60° C. for at least 30 minutes. Q 0.2 μmNanosep spin column (Pall) was used to remove gel fragments and usedethanol precipitation to recover the DNA from the solution, and finallyresuspended with 30 μl dH₂O.

BSPP Sanger Sequencing Validation

To validate the accuracy of the methylation level determined by thepadlock capturing and Illumina Genome Analyzer sequencing, primers for33 targeted sites were designed (Table 3) and PCR amplification followedby conventional Sanger sequencing were performed. For each 10% intervalof methylation level (0%, 10%, 20%, . . . 90%, and 100%), three siteswere randomly chosen (thus 33 sites total). For each site, real time PCRwas performed in duplicates of 40 μl reaction with 50 ng of bisulfiteconverted GM06990 genomic DNA, 1× iQ SYBR Green supermix (Bio-Rad), and500 nM each of forward and reverse primers. The PCR program was 96° C.for 3 minutes, 40 cycles of 95° C. for 30 seconds, 62° C. for 1 minute,and 72° C. for 1 minute, and finally 72° C. for 5 minutes.

TABLE 3PCR primers used to validate methylation level determined by BSPP. This is atable of the targets and primers that were used for Sanger sequencing validation of theBSPP data. Methylation Location of level queried cytosine measuredPrimer name (chr_position) by BSPP Primer sequence 0.0_F_chr5_141931200chr5_141931200   0% TTAAAGGATTTTAGGAATTTTATTAGTT 0.0_R_chr5_141931200chr5_141931200   0% AAATACTATCAAAAACTACTTCCAAAC 0.0_F_chr11_2278418chr11_2278418   0% GTTGTGGTTAGATTTGGTTTTT 0.0_R_chr11_2278418chr11_2278418   0% ACCTTAACCTCCCTAAAACTAATAA 0.0_F_chr21_32895366chr21_32895366   0% AAGTTTTTTTAGTAAGGTTGGGA 0.0_R_chr21_32895366chr21_32895366   0% CACTACACTCTATCCTAAACAACAA 0.1_F_chr5_131430745chr5_131430745  10% ATTTTTTGGTTTTAGGTTTATAGTG 0.1_R_chr5_131430745chr5_131430745  10% AAATCTCTCTCAAAAATTCCTTAA 0.1_F_chr11_4861886chr11_4861886  10% TTAATTTGGTTTGTTGATTTTAGTT 0.1_R_chr11_4861886chr11_4861886  10% CTCACCTAAAAAATATATAAATCCC 0.1_F_chr22_31384238chr22_31384238  10% GTGAATAGGTTAAGTGAGGTAGAAG 0.1_R_chr22_31384238chr22_31384238  10% AAAAAAATCAAACACCAACTATAAA 0.2_F_chr11_2136827chr11_2136827  20% GGGTGAGTAGTAGGTTTGTAGTAAA 0.2_R_chr11_2136827chr11_2136827  20% CAAATAACACCATAAACTAAAACAA 0.2_F_chr14_98615277chr14_98615277  20% TTTGTTTTAAGTTTTTAAAGGGTAA 0.2_R_chr14_98615277chr14_98615277  20% AAATACTCTAAATTTCTCACAACCTAC 0.2_F_chr19_59719262chr19_59719262  20% GTAGGTTTTAGGAATTTTAGGATAGA 0.2_R_chr19_59719262chr19_59719262  20% TAAAACCCTTTACATTTCAATAAAT 0.3_F_chr2_220372916chr2_220372916  30% TTTTATTTAGAGTTGTTTTATGTTAAGG 0.3_R_chr2_220372916chr2_220372916  30% ATCTCCTATAAATCCCCAATTAATA 0.3_F_chr5_131431205chr5_131431205  30% GTTTTGGTAGAGATTTGTTTGG 0.3_R_chr5_131431205chr5_131431205  30% AAAAAAAACCCCTACTCTACTACTC 0.3_F_chr11_64232560chr11_64232560  30% AGGTGATATGAGGAAGTATTGTTAT 0.3_R_chr11_64232560chr11_64232560  30% AAACCTCCATACTAAAAAATTTACAT 0.4_F_chr5_141987439chr5_141987439  40% TTAGATTTTATTTTGGATTTTGAAA 0.4_R_chr5_141987439chr5_141987439  40% CTCTACAAAAACTTAACCCTTAAAA 0.4_F_chr16_25811080chr16_25811080  40% GAAAATTTGATTTTAAAAGAATGTG 0.4_R_chr16_25811080chr16_25811080  40% TTTTAAAAATAACAAAATCAACTCC 0.4_F_chr22_31195396chr22_31195396  40% TTAATTGAAGATTAAATATTTTTGAGAT 0.4_R_chr22_31195396chr22_31195396  40% CTTTAAAATTTCCTTTTAACCAAAT 0.5_F_chr7_27107421chr7_27107421  50% GGAGTTTTTAAGGTTTTTATATTTTTT 0.5_R_chr7_27107421chr7_27107421  50% CCAACACACAACTTCTAAAACTAA 0.5_F_chr11_1943248chr11_1943248  50% TTAGGAGGTGTTTAGATGATTTTAG 0.5_R_chr11_1943248chr11_1943248  50% CCCAATATATACACAACCAAAAC 0.5_F_chr11_130700083chr11_130700083  50% ATGTTTGTGAAAGTAGGAGTTTATT 0.5_R_chr11_130700083chr11_130700083  50% TACTCTTATCCCTTCTCCCTAATAT 0.6_F_chr5_131557355chr5_131557355  60% GATTGTTAGTATTGTAGAGGGTTTG 0.6_R_chr5_131557355chr5_131557355  60% AACTTCAATAATACATTAAAATAAAATTTT 0.6_F_chr16_25856798chr16_25856798  60% GATTTTTAGTTTTGTAGTGTTGAGG 0.6_R_chr16_25856798chr16_25856798  60% CTAATAAAATCTAAATTCAAAAACACTTAT 0.6_F_chrX_153233491chrX_153233491  60% TTTGTGTTAGTTTTGGGTTTAATAT 0.6_R_chrX_153233491chrX_153233491  60% CAACCTTCAATAAAAACAAACTATT 0.7_F_chr1_149600103chr1_149600103  70% TAAGTTAGGTGTTGGGAGTTAATAG 0.7_R_chr1_149600103chr1_149600103  70% TAAAATATCCACCTCAACTAAAATC 0.7_F_chr11_64054275chr11_64054275  70% TGATTTTATTTTGAAAGTGAAGTTT 0.7_R_chr11_64054275chr11_64054275  70% ATTTTCACAAAAACTATAAAACACAA 0.7_F_chrX_153373129chrX_153373129  70% GATTTGTTTGTTTTTTTAAATTTTG 0.7_R_chrX_153373129chrX_153373129  70% AAATTAATTCCAATTACACCAATAA 0.8_F_chr21_39672131chr21_39672131  80% AAAATATTGGGATTATAGGTATGAGT 0.8_R_chr21_39672131chr21_39672131  80% AACTTCTAAACTAACCAAAACAAAA 0.8_F_chr22_31794899chr22_31794899  80% TGTTTTAGGAGGTGAATAAATTAAT 0.8_R_chr22_31794899chr22_31794899  80% AACCTTATAAACTTCACAATCAAAC 0.8_F_chrX_152958511chrX_152958511  80% TTTATTTAATATATGTTGGATGAATAATTA 0.8_R_chrX_152958511chrX_152958511  80% CTAAAACCCTCCTCAATAACTTC 0.9_F_chr6_108430751chr6_108430751  90% TGTTAATGAATATAATGTTTTGTTTTT 0.9_R_chr6_108430751chr6_108430751  90% TAATACCCAACTAACTCCCTACTAA 0.9_F_chr8_119031762chr8_119031762  90% TTATAGTTTGGGTGATAGAGTAAGATT 0.9_R_chr8_119031762chr8_119031762  90% AAACCCTAAACAAAATACTCAATATAA 0.9_F_chr22_30304462chr22_30304462  90% GGTAGATATGTTGTTGTGTGTAGAA 0.9_R_chr22_30304462chr22_30304462  90% AAAAAAACTTCATAACCAAAACTC 1.0_F_chr2_118425897chr2_118425897 100% TATGATAGAGGTGGTAGTAGAGGTG 1.0_R_chr2_118425897chr2_118425897 100% TTCCAATTATCTCCTAAACAAAATA 1.0_F_chr6_74157666chr6_74157666 100% AAAAGTTTAGTATATTTTGTGGTTTTT 1.0_R_chr6_74157666chr6_74157666 100% CACCAATATATTATAAAAAAACTCTTTATT 1.0_F_chr11_1933957chr11_1933957 100% GGGGTAGATATTAGGTTTTAAAGAG 1.0_R_chr11_1933957chr11_1933957 100% AACTACAAAAACTCCTCAACAAA

Rather than isolating and sequencing many clones for each target, Sangersequencing was performed on the raw PCR products (an average of threesequencing reaction per site) and methylation levels were inferred basedon the sequence traces using the program PeakPicker (Ge, B. et al.,Survey of allelic expression using EST mining. Genome Res 15 (11), 1584(2005)) to measure the heights of a set of peaks within the trace.However, due to the lack of cytosines in the sequence, the cytosine inthe queried CpG had an abnormal peak height, which made comparisonbetween C and T peaks not informative. Instead, he height of the “T”peak within the CpG location as well as the heights of four upstream anddownstream “T” peaks were measured. The surrounding peaks providednormalization for measuring the relative height of the target “T” peakto a 100% value. The ratio of the target “T” (in the CpG) to the averageof the surrounding “T” peaks was inferred to reflect the fraction ofunmethylated cytosine at that position. This was similar to theprinciple applied in the commercially available software ESME (Lewin, J.et al., Quantitative DNA methylation analysis based on four-dye tracedata from direct sequencing of PCR amplificates. Bioinformatics 20 (17),3005 (2004)). Because sequencing reactions were performed for multipletimes and from both directions, these generated multiple estimates fromwhich average and standard deviation values were obtained for themethylation of that site (FIG. 1C).

BSPP Library Sequencing, Placement of Reads, and Determination ofMethylation

Libraries were sequenced using a single lane of Illumina's Solexasequencing system per sample (Table 4). The sequencing primer used wasCACTCTTTCCCTACACGAC GCTCTTCCGATCT. Reads were mapped by using BLAST witha custom database of the 9,552 target sequences (with an “N” at any CpGcytosine position). Any placements with mismatches in the 10 base pair“span” (bases polymerized rather than originally part of the probe) werediscarded to reduce the chance of including data from probesmis-hybridized to other genomic locations. Accepted reads were thencombined for each given probe to determine methylation at each position,based on the number of reads that had a C or T at a given position. Onlyprobes with at least 10 reads in a sample were used to measuremethylation level.

TABLE 4 BSPP Illumina sequencing statistics. This table containsstatistics for the number of reads and number of matched reads for theIllumina runs used for the BSPP method. Each sample (row) corresponds toa single lane of sequencing. Number of Number of Number Number probeswith at probes with at Number matched accepted least 1 read least 10reads Sample of reads (percentage) (percentage) (percentage)(percentage) GM06990 4,107,685 3,689,651 2,040,725 7,453 (78.0%) 5,833(61.1%) Tech rep 1 (89.8%) (49.7%) GM06990 3,015,101 2,794,275 2,259,7557,418 (77.7%) 5,952 (62.3%) Tech rep 2 (92.7%) (74.9%) PGP1 2,683,2132,487,042 1,900,742 8,079 (84.6%) 6,754 (70.7%) lymphocyte (92.7%)(70.8%) PGP9 8,668,249 7,978,035 5,807,123 8,109 (84.9%) 7,195 (75.3%)lymphocyte (92.0%) (67.0%) PGP1 1,468,378 1,364,329 1,101,446 7,131(74.7%) 5,384 (56.4%) fibroblast (92.9%) (75.0%) PGP9 3,242,8452,921,455 2,214,021 7,865 (82.3%) 6,630 (69.4%) fibroblast (90.1%)(68.3%) PGP1 iPS 283,724   247,492   193,621 6,566 (68.7%) 3,942 (41.3%)(87.2%) (68.2%) PGP9 iPS 528,421   478,597   369,861 7,061 (73.9%) 4,790(50.1%) clone 1 (90.6%) (70.0%) PGP9 iPS 8,973,759 8,281,843 5,800,7248,507 (89.1%) 7,606 (79.6%) clone 2 (92.3%) (64.6%)

Comparison of GM06990 Methylation Levels and Gene Expression Levels

To compare our methylation data for the GM06990 cell line to geneexpression levels, the Affymetrix PolyA+ RNA signal track was downloadedfrom UCSC (http://genome.ucsc.edu/ENCODE/). The data were examined forexons as annotated by RefGene, and the median value was determined as arecord of gene expression level (for multiple possible transcripts, thetranscript with the smallest difference between median values in theexons was taken). After excluding genes on the X chromosome, anexpression ranking for 347 genes was obtained.

Each methylation data point was assigned position information accordingto its location relative to nearby genes. To create a profile examiningmethylation over an entire gene, sites within a gene were given aposition value based on their relative position within the gene fromtranscriptional start to end (a fraction between 0 and 1). Sitesupstream or downstream of genes were recorded according to number ofbases from the gene boundaries. Genes were split into two groups basedon expression level and for each group the data was combined to create amethylation profile by calculating the running median (and quartiles)using a window of 0.1 inside the gene and 3000 base pairs outside.

Comparison of GM06990 Methylation and Chromatin Immunoprecipitation(ChIP) Data

To compare cytosine methylation levels with histone modifications, theSanger ChIP data for the GM06990 cell line was downloaded from theENCODE project data at UCSC (http://genome.ucsc.edu/ENCODE/). The rawChIP score was plotted for a given experiment against the methylationobserved at that position to look for correlation(s) between cytosinemethylation and histone modification types.

Rather than seeing the expected correlation with H3K4 methylation,correlations with histone modifications (H4ac, H3K27me3, H3K29me3) wereobserved that had significant correlation with gene expression over thebodies of genes. Without intending to be bound by scientific theory,this is probably because few of the sites were close to thetranscription start sites of genes (5.5% within 1 kb) but most weredistributed over the bodies of genes (55.1%). To visualize the profilesof histone modification over the body of genes, the Sanger ChIP data wasused to create running medians and quartiles for the high and lowexpression gene sets analogous to the one created for our cytosinemethylation data.

PGP Cell Line Methylation Analysis

The methylation patterns of different cell lines were evaluated bycomparing methylation levels at all locations for which data wasobtained from both lines (FIG. 9). For the PGP cell lines, geneexpression described herein was used, and the genes were rankedaccording to expression level. The genes were then split into two groupsas before. Running median and quartile profiles were created as withGM06990 data (FIG. 10).

It was observed that PGP9 B-cells showed higher overall methylationcompared to PGP1 B cells. Without intending to be bound by scientifictheory, this may have been an artifact of cell culture differences (anincreasing number of passages in cell culture of neural precursor cellshas, for example, been observed to result in gradual hyper-methylation(Meissner, A. et al., Genome-scale DNA methylation maps of pluripotentand differentiated cells. Nature 454 (7205), 766 (2008))).

Methyl Sensitive Cut Counting (MSCC) Library Creation

Methyl sensitive cut counting uses a library created from all endsproduced by HpaII digestion. To create the two adapters needed in thelibrary construction, the following PAGE purified oligos were orderedfrom Integrated DNA Technologies (IDT).

AdA: CAAGCAGAAGACGGCATACGAAGAGTCTCTATATGCATCGATGCAGATCA CGATCCGA AdA_RC:CGTCGGATCGTGATCTGCATCGATGCATATAGAGACTCTTCGTATGCCGT CTTCTGCTTG AdB:AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCT TCCGATCTNN AdB_RC:AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCC GTATCATT

These were resuspended with TE 8.0 to a concentration of 40 μM. Equalamounts of AdA and AdA_RC were mixed to form a 20 μM solution of adapterA, and equal amounts of AdB and AdB RC were mixed to form a 20 μMsolution of adapter B. To hybridize, these were placed in a PCR machineprogrammed to hold at 94° C. for 5 minutes, then drop 0.1° C. every 2seconds, finally holding at 4° C. The resulting adapters had thisdesign:

Adapter A 5′CAAGCAGAAGACGGCATACGAAGAGTCTCTATATGCATCGATGCAGATCACGATCCGA 3′ 3′GTTCGTCTTCTGCCGTATGCTTCTCAGAGATATACGTAGCTACGTCTAGTGCTAGGCT GC 5′Adapter B 5′  AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT 3′ 3′NNTCTAGCCTTCTCGCAGCACATCCCTTTCTCACATCTAGAGCCACCAGCGGCATAGTAA 5′

Adapter A contained: Solexa bridge PCR sequence (CAAGCAGAAGACGGCATACGA), 5′ CG overhang that matches HpaII cut ends (underlinedGC), MmeI recognition site (bold text, complete after ligation). AdapterB contained: Solexa bridge PCR sequence (AGCGGCATAGTAA), Solexa genomicDNA sequencing primer (TCTAGCCTTCTCGCAGCACATCCCTTTCTCACA), 5′ NNoverhang that matches MmeI cut ends. Final library molecules will have alength of 137-138 base pairs and look like this:

CAAGCAGAAGACGGCATACGAAGAGTCTCTATATGCATCGATGCAGATCACGAGTTCGTCTTCTGCCGTATGCTTCTCAGAGATATACGTAGCTACGTCTAGTGCTTCCGACGGNNNNNNNNNNNNNNNNNNAGATCGGAAGAGCGTCGTGTA GGGAAAGAGTGAGGCTGCCNNNNNNNNNNNNNNNNNNTCTAGCCTTCTCGCAGCACATCCCTTTCTCACTAGATCTCGGTGGTCGCCGTATCATTATCTAG AGCCACCAGCGGCATAGTAA

To create an MSCC HpaII library, 2 μg of genomic DNA was used toassemble a 100 μl reaction with 20 units of HpaII (NEB) in 1×NEBuffer 1. This solution was incubated at 37° C. for 2 hours and then65° C. for 20 minutes. 1.66 μl of 10 μM adapter A was a to the mixture,along with 12 μl 10 mM ATP (NEB) and 6 μl T4 DNA ligase (NEB). Thisconcentration of adapter was estimated to be 3.3-fold excess to cut endsif all possible HpaII targets were cut (but, because most sites aremethylated, it should actually be in the range of 8-to-15-fold excess).The ligation mixture was incubated at 16° C. for 4 hours, then at 65° C.for 15 minutes. After ligation, ethanol precipitation was performed byadding 2 μl NF pellet paint (Novagen), 14 μl 3M sodium acetate, 280 μlethanol, left overnight at −20° C. The precipitate was spun down, washedwith 75% ethanol, and allowed to air dry. Because the adapters were notphosphorylated (preventing self ligation), only one of the two backbonesof the adapters was ligated. To perform a nick repair, the pellets wereresuspended to assemble a 50 μl reaction mixture with 8 units Bst DNApolymerase large fragment (NEB), 200 μM dNTP concentration and 1×thermopol buffer (NEB). This mixture was incubated at 50° C. for 20minutes, then at 85° C. for 25 minutes. The nick-repaired fragments wereprecipitated by adding 6 μl 3M sodium acetate and 120 μl ethanol andstoring at −20° C. for several hours. MmeI digestion was then performedby resuspending the pellets into a 50 μl reaction mixture containing 2units of MmeI (NEB), 50 μM SAM (NEB) and 1× NEBuffer 4. This wasincubated at 37° C. for 2 hours, then 80° C. for 20 minutes. 1.66 μl of10 μM adapter B was add the mixture, along with 6.1 μl 10 mM ATP (NEB)and 3 μl T4 DNA ligase (NEB), then incubated at 16° C. for 4 hours, and65° C. for 15 minutes.

To purify the target product before PCR amplification, the resultingunpurified reaction mixture was run on a 6% non-denaturing TBEpolyacrylamide gel (NOVEX), alongside a NEB Low Molecular Weight DNAladder. This size range was cut from the gel using SYBR Gold stain(Invitrogen) and Dark Reader (Clare Chemical) to avoid UV exposure. DNAwas eluted from the polyacrylamide gel by incubating gel fragments in 2×volume of TE 8.0 and shaking at 60° C. for 30 minutes. A 0.2 μm Nanosepspin column (Pall) was used to remove gel fragments and used ethanolprecipitation (with 2 μl NF Pellet Paint) to recover the DNA from thesolution, and finally resuspended with 30 μl TE 8.0.

Because the concentration of this library could have been too low forsequencing, 25 μl of the sample was amplified by assembling a 100 μl PCRmixture containing 500 nM each of primers (SLXPCR-AdA:CAAGCAGAAGACGGCATACGA and SLXPCR-AdB: AATGATACGGCGACCACCGAG), 200 μMdNTPs, 1× HF buffer and 2 units iProof (Bio-Rad). This mixture was thendenatured at 98° C. for 30 seconds, cycled 8 times 98° C.-10s/67°C.-15s/72° C.-15s, then held at 72° C. for 5 minutes. The resultingproduct was purified with the QiaQuick PCR clean-up kit (Qiagen),eluting with 30 μl EB. Sample concentration was measured with aNanodrop, then diluted to 840 pg/μl (10 nM) for Illumina/Solexasequencing.

The MspI control library was constructed in the same manner as the HpaIIlibrary, with the following changes: (i) in the first step 40 units ofMspI (NEB) were used in place of HpaII and NEBuffer 2 was used insteadof NEBuffer 1; and (ii) no amplification was done after gelpurification.

The inverse library was constructed in this manner: HpaII digestion wasperformed as done in the HpaII library. After this, 10 units AntarcticPhosphatase (NEB) and 11 ml 10x Antarctic Phosphatase Buffer (NEB) wereadded to the mixture, which was then incubated at 37° C. for 1 hour, and65° C. for 15 minutes. DNA was purified with phenol:chloroform followedby ethanol precipitation. The DNA was then resuspended and treated inthe same manner as the MspI control library.

MSCC Sequencing and Placement of Reads

In total, three lanes of sequencing were performed, two lanes fortechnical replicate 1 and one lane for replicate 2 (Table 5).

TABLE 5 MSCC Illumina sequencing statistics. This table containsstatistics for the number of reads and number of matched reads for theIllumina runs used for the MSCC method. Each row corresponds to a singlelane of sequencing. Number Number Percentage Average Number of matchedaccepted of tags seen number of reads Sample reads (percentage)(percentage) at least once per tag PGP1L 6,052,886 3,598,311 1,765,70938.0% 0.77 HpaII (59.4%) (29.2%) Tech rep 1, round 1 PGP1L 5,759,7384,233,294 2,303,336 43.4% 1.0 HpaII (73.5%) (40.0%) Tech rep 1, round 2PGP1L 8,579,795 6,397,139 3,536,353 53.6% 1.5 HpaII (74.6%) (41.2%) Techrep 2 PGP1L 20,392,419 14,228,744  7,605,398 65.7% 3.3 HpaII (69.8%)(37.3%) total PGP1L 10,423,134 8,682,641 4,319,599 76.0% 1.9 MspI(83.3%) (41.4%) Control PGP1L 6,355,775 4,954,057 2,172,381 45.3% 0.94Inverse (77.9%) (34.2%) library

Using the human genome reference sequence, a list of all possible HpaIIsites was created. Each CCGG site present generated two possible tags,from upstream and downstream sequence. Statistics for possible HpaIIsites, and other enzymes, are available in Table 6. Reads were matched(using an in-house Perl program) if they were within two bases of anexpected tag. A read was then accepted if it met two criteria: (a) itwas either exact and there were no single mismatch possibilities, or ithad a single mismatch and there were no double mismatch possibilities,and (b) it matched a subset of “unique” positions that were more thanone base different in sequence from all other tags (1.4 millionlocations). Note that the amount of information and accuracy obtainedfor each site was not related to “coverage” of the site. Sites with noreads were inferred to be highly methylated (and hence uncut andunrepresented in the library). Sites with more reads were inferred tohave lower methylation. Read counts information for all unique HpaIIsites are set forth in Table 7.

TABLE 6 Methylation sensitive enzymes and their site frequencies. This table contains some statistics for HpaII and some other methylation-sensitive enzymes: the number of sites and number of  “unique”sites that could be profiled by MSCC. Based on the March 2006 human reference sequence (NCBI Build 36.1) downloaded from UCSC. Uniquesites were based on tags created with MmeI (18 or 19 bases of sequence) and were required to beat least two bases different from all other possible tag sequences. Numbers of unique sites for enzymes other than HpaII were estimates based on analysis of a random set of 10,000  locations. Number ofNumber of Restriction Recognition sites in “unique” enzyme sitehuman genome  sites HpaII CCGG 2,321,216  1,417,432  (61.1%) HhaI, CfoIGCGC 1,674,129 ~950,000  (60%) AciI CCGC 4,153,824 ~2,500,000  (60%)HpyCH4IV,  ACGT 2,167,347 ~1,500,000  MaeII (70%) BstUI, MvnI CGCG693,643 ~420,000  (60%)

TABLE 7 MSCC data (separate file) and description of columns. This Tablecontains the locations and read counts data for all unique HpaII sitesprofiled with MSCC. Each site could produce two possible tags: “strand”refers to the two strands based on whether they are generated fromupstream (minus) or downstream (plus) sequence. Although the read countsare separated here, the MSCC data analysis used the sum of columns 4, 5and 6. Column Description 1 chromosome 2 location 3 strand 4 HpaIITechnical replicate 1, sequencing lane 1 5 HpaII Technical replicate 1,sequencing lane 2 6 HpaII Technical replicate 2 7 MspI control 8 Inverselibrary

Based on a median over-dispersion of 62% compared to Poisson standarddeviations when data was binned, HpaII tag counts were modeled asarising from gamma-Poisson and simulated 1, 3 and 8 lanes of read countdata (FIG. 17). Paired tags were then modeled as a sum of twoindependent numbers generated from the same distribution. Using thismodel, given 1, 3 and 8 lanes of sequencing data, it was estimated thatthe probabilities (shown in Table 8 with unit %) in observing morecounts at one paired tag site compared to another paired tag site withhigher underlying methylation level (note that the count number isanti-correlated with methylation level).

Comparison of Methylation Level Determined by MSCC or BSPP

Because the same sample (PGP1 lymphocyte) was analyzed with the BSPPENCODE set, these data sets overlap at 381 sites (345 of these have twounique tags, and so there is data for a total of 726 tags). The raw datawas compared as the number of counts (MSCC combined data) vs.methylation level (BSPP data) for each point. It was expected, when datawas averaged over many points, that a linear correlation between averagenumber of counts and the level of methylation would be observed. Forexample, a site that was completely cut (0% methylation) should producetwice as many library molecules (and, on average, have twice as manyobservations) compared to a site that was 50% cut (50% methylation).Sites that were completely methylated were expected to have zeroobservations. The expected relationship between methylation level andthe average counts observed for that methylation level was:methylation=1−C*average counts.

The 726 data points were divided into 22 equally sized groups of 33 datapoints. For each, the average methylation was calculated and theaverages were used to determine the best fit value for C: 0.1128(completely unmethylated sites have an average of 8.9 observations). Thestandard deviation was also determined and included for a Poissondistribution with a lambda predicted by the linear equation (FIG. 3B).Thus, the equation that was later used to relate average counts tomethylation was: methylation=1−0.1128*average counts.

Comparison of MSCC Methylation with Gene Expression Levels

Gene positions calculated were based on the RefGene list downloaded fromUCSC (for genes with multiple possible starts/ends, only the first entrywas used). Using the same expression data as used earlier for the PGP1lymphocyte cell line, a list of 17,546 genes was split into five equallysized groups based on their gene expression levels.

As with the BSPP data, the position of each CpG position was recordedrelative to the gene start, the gene end, and the fraction within thegene. A running average of counts was created for each. FIG. 2B used awindow of 5000 data points, and FIGS. 2C-2F used a window of 2000 datapoints or 200 base pairs (the larger of the two; for each step anaverage position as well as average methylation was taken for all thedata points). For FIG. 2B, upstream and downstream methylation werecombined with fractional gene position to create an “average gene.” ForFIGS. 2C and 2D, positions relative to 5′ and 3′ ends of the gene wereused, respectively, to create average methylation profiles for each endof genes. For FIGS. 2E and 2F, the MSCC data was split into two groupsbased on whether the site profiled was within a CpG island (based onUCSC's CpG island annotation). The data from these two groups was thenused to generate profiles of promoter methylation based only on thatsubset of data (inside CpG islands and outside CpG islands,respectively). Counts were normalized for local CpG density (surrounding200 base pairs), for MspI control library counts, and, for the in-genein FIG. 2B, for gene length.

MSCC Methylation Profiles for Individual Genes

Based the analyses presented herein (mainly FIG. 2), it was decided thatthere were two methylation measurements that might be relevant toclassifying genes according to epigenetic state: promoter methylationand gene body methylation. Based on FIG. 2C, the promoter region wasdefined as spanning −400 base pairs to +1000 base pairs relative totranscription start. Based on FIGS. 2C and 2D, the gene body region wasdefined as the region between +3000 base pairs relative to transcriptionstart and the end of the gene. For each gene, all the data points fromthe promoter and gene body regions were gathered and, using the genesthat had at least ten data points in each region, each gene was plottedaccording to the average counts for each region (FIG. 13, x axis=genepromoter, y axis=gene body, color=gene expression rank). Because alarger difference was observed, on average, in promoter methylationrelative to gene expression for data points arising from outside CpGislands, the same promoter vs. methylation graph was generated usingonly data points from outside CpG islands (FIG. 3A). For this plot aminimum of five data points were required for each region.

To create a histogram of average gene body methylations individualgenes, the same gene body methylation averages were used, but the setwas restricted to genes with 50 or more data points within that region(FIG. 3B).

1. A method for determining a methylated cytosine profile of a targetnucleic acid sequence comprising the steps of: providing a sample ofnucleic acid sequences; contacting the sample with a chemical agent thatconverts unmethylated cytosine residues in the nucleic acid sequences touracil residues; contacting the sample with a plurality of nucleic acidprobes, wherein the probes are designed to hybridize randomly along atarget nucleic acid sequence; allowing hybridization of the plurality ofnucleic acid probes to the target nucleic acid sequence; forming aplurality of circular nucleic acid sequences, each of the circularsequences comprising a nucleic acid probe sequence and a target nucleicacid sequence; amplifying the plurality of circular nucleic acidsequences to form a plurality of amplified target nucleic acidsequences; and sequencing the amplified target nucleic acid sequences.2. The method of claim 1, wherein the chemical agent is bisulfate. 3.The method of claim 1, wherein the probes are designed to hybridize topromoter regions along a target nucleic acid sequence.
 4. The method ofclaim 1, wherein amplification primers hybridize to nucleic acid probesequences during the step of amplifying.
 5. The method of claim 1,wherein the nucleic acid probes are padlock probes.
 6. The method ofclaim 1, wherein the target nucleic acid sequence is genomic DNA.
 7. Themethod of claim 6, wherein the genomic DNA is whole genome DNA.
 8. Themethod of claim 1, wherein the target nucleic acid sequence is a gene.9. The method of claim 1, wherein the target nucleic acid sequence is apromoter region.
 10. A method for determining a methylated cytosineprofile of a target nucleic acid sequence comprising the steps of:providing a sample of nucleic acid sequences; cleaving the nucleic acidsequences in a methylation-dependent manner to generate a plurality ofcleaved target nucleic acid sequences; ligating first adapter sequencetags to the 5′ ends of cleaved target nucleic acid sequences and secondadapter sequence tags to the 3′ ends of the cleaved target nucleic acidsequences; amplifying the cleaved target nucleic acid sequences havingfirst and second adapter sequence tags ligated thereto; and sequencingthe amplified, cleaved target nucleic acid sequences.
 11. The method ofclaim 10, wherein the step of cleaving the nucleic acid sequences in amethylation-dependent manner comprises contacting the nucleic acidsequences with a methyl sensitive restriction enzyme to cleaveunmethylated CpG dinucleotide sequences.
 12. The method of claim 10,wherein amplification primers hybridize to the first or the secondadapter sequence tags during the step of amplifying.
 13. The method ofclaim 10, wherein the target nucleic acid sequence is genomic DNA. 14.The method of claim 13, wherein the genomic DNA is whole genome DNA. 15.The method of claim 10, wherein the target nucleic acid sequence is agene.
 16. The method of claim 10, wherein the target nucleic acidsequence is a promoter region.
 17. The method of claim 10, furthercomprising the step of comparing the methylated cytosine profile of thetarget nucleic acid sequence to a methylated cytosine profile of acontrol library.
 18. The method of claim 17, wherein the control libraryis generated by contacting a target nucleic acid sequence with amethylation-insensitive enzyme.
 19. The method of claim 18, wherein themethylation-insensitive enzyme is MspI.
 20. A method for determining acomplementary methylated cytosine library of a target nucleic acidsequence comprising the steps of: providing a sample of nucleic acidsequences; cleaving the nucleic acid sequences in amethylation-dependent manner to generate a plurality of cleaved targetnucleic acid sequences; blocking the ends of the cleaved target nucleicacid sequences to prevent the cleaved target nucleic acid sequences fromcontributing to library construction; and contacting the blocked,cleaved target nucleic acid sequences with a methylation-insensitiveenzyme to create a complementary methylated cytosine library thatcomprises a plurality of nucleic acid sequences that were not cleaved ina methylation-dependent manner.
 21. The method of claim 20, wherein theblocking step comprises dephosphorylating the 5′ ends of the cleavedtarget nucleic acid sequences.